Introduction to the practical application of code understanding technology

Author | CQT&Nebula Team

I. Introduction

As one of the important technologies of software knowledge graph, code understanding can provide basic technology and data guarantee for construction, testing, positioning, code interpretation, etc. It is also the starting point of continuous integration. Only by understanding the code can we carry out purposeful and effective construction. Code understanding is of great significance to the success of software development and the efficiency of maintenance. It is the key to improving software quality, reducing development costs and improving development efficiency.

2. What is code understanding?

Code understanding is a technical field that takes software systems as the object of analysis, analyzes its internal information and operational processes, and obtains relevant knowledge and information. This information can be accessed and applied at various stages of CI&CD.

There are three commonly used analysis methods for code understanding: static analysis, dynamic analysis, and non-source code analysis. However, with the advent of the LLM large model era, we are also studying the breakthroughs and applications of large models in the field of code understanding.

Static analysis : refers to scanning the program code through lexical analysis, syntax analysis, control flow, data flow analysis and other technologies without running the code to verify whether the code meets the norm, security, reliability and maintainability. A code analysis technology with other indicators.

Dynamic analysis : A technique for analyzing the behavior of a software system before, during, and after its execution in a simulated or real environment.

Non-code analysis : It mainly conducts correlation analysis between non-source code files such as data files and configuration files and the source code. When the code warehouse is changed, the impact of the changed content on the source code and functions can be perceived.

LLM-based analysis : Relying on the reasoning and deduction capabilities of large models to mine knowledge in program dynamic and static data.

3. The main role of code understanding

This article mainly introduces static code analysis in detail. Static code analysis is an important technology in code understanding, and it plays an important role in all stages of CI&CD:

1. Static code analysis can help developers understand the structure, logic and function of the code. By analyzing the syntax, semantics, and behavior of the code, static code analysis tools can discover potential problems and errors, and make repairs and improvements, which can greatly improve the maintainability and reliability of the code.

2. Static code analysis can provide developers with suggestions for refactoring and optimization by analyzing the logic and structure of the code, helping developers improve the quality and efficiency of the code. This can make the code easier to maintain, extend, and update.

3. Static code analysis can also help developers detect security vulnerabilities. By analyzing the logic and data flow of the code, static code analysis tools can detect potential security vulnerabilities and attack paths, and provide corresponding suggestions and repair plans to improve system security.

4. Static code analysis can also help developers conduct automated testing. By analyzing the behavior and functions of the code, static code analysis tools can generate test cases, automatically perform testing tasks, and improve testing efficiency and accuracy.

5. By understanding code, team members can collaborate better because they know how to understand and use each other's code.

6. By understanding the code, developers can more easily reuse the code in other projects or systems because they know how to modify and adapt the code.

The main applications of static code understanding in CI&CD include:

1. Vulnerability detection and repair : Static code understanding tools can detect errors, potential problems and other non-conformities in the code through syntax analysis, semantic analysis, etc., helping developers find and repair problems in a timely manner and improve the quality of the code. and reliability.

2. Code reconstruction and optimization : Static code understanding tools can analyze the syntax and structure of the code without executing the program, provide suggestions for code reconstruction and optimization, and help developers improve the code structure and improve code efficiency.

3. Code quality assessment : Static code understanding tools can conduct a comprehensive quality assessment of the code, including the readability, maintainability, scalability, etc. of the code, helping developers better understand the quality status of the code and take appropriate measures. measures to improve code quality.

4. Automated testing : Static code understanding tools can transform manual intervention in the traditional development process into automated processes, such as automated build, automated testing, automated deployment, etc. This can greatly shorten the development cycle and improve development efficiency and quality.

4. Introduction to typical technical solutions

Traditional code understanding technical solutions mainly consist of three parts: code analysis layer, code analysis layer, and application layer. As shown in the figure below, the overall solution process can be briefly described as parsing the source code through syntax, constructing an abstract syntax tree (AST) or intermediate representation (IR) and traversing it, extracting various features of the code, and finally generating a code feature file.

picture

1. Source code parsing : This is the first step in building an AST or IR and involves breaking down the source code into smaller parts (e.g., words, symbols, expressions, etc.). This step is usually performed by a lexical analyzer (also called a scanner or lexer).

2. Build AST/IR : Once the source code is parsed into smaller parts, these parts can be combined into an abstract syntax tree (AST) or intermediate representation (IR). Both AST and IR are structured representations of source code, but they differ in granularity and complexity. AST is closer to the syntax structure of source code, while IR is closer to machine code. The process of building an AST or IR is usually performed by a parser.

3. Extract features : Traverse each node that accesses the AST or IR. During the traversal process, various features about the code can be extracted, such as variables, functions, control flow structures, data dependencies, etc. Specific code patterns can also be identified and certain code metrics (for example, code complexity, repeated code etc.), or extract other information about the code.

4. Generate feature files and apply : Store the extracted code features in code feature files (for example, JSON, XML or database), which can be used by the application party in various analysis scenarios.

However, traditional code understanding technical solutions have many challenges in each link during the practical process, and there is great resistance to implementation. From the perspective of code parsing infrastructure, there are high requirements for professional knowledge of code parsing technology, the selection and transformation of open source parsing tools in various languages, and high requirements for storage expansion and execution efficiency. For the code analysis layer, traversing the AST requires high analysis capabilities, and the analysis capabilities cannot be reused. There is a lot of repetitive development work for diverse scenarios. From the perspective of data users, the threshold for implementation is high, and the standards for code feature files in different languages ​​are different, requiring repeated adaptation. It is less open to applications and has higher development costs for scenarios that support expansion.

5. Our technical solutions

In order to solve the problems of high application threshold, high code analysis cost, and high requirements for professional knowledge in the above traditional technical solutions, we have proposed a new solution: redefine the code parsing service, decouple parsing and analysis, and establish a three-tier architecture. Code understanding service.

picture

5.1 Base layer

The base layer aims to solve the problem of relying on high professional knowledge requirements at the code parsing level by building efficient and easily scalable multi-language code to compile front-end data. The base layer mainly consists of the following parts:

1. Multi-language parser. Choose appropriate code analysis solutions for different code languages, such as cppcheck 2.5 in c++ language, go native ast in go language, etc. Each parser must ensure excellent performance, and the parsing efficiency can reach 2 million+ lines of code/hour. The parser parses the code into multiple granular data such as token, ast, and symbol for use by the upper analysis layer.

2. Data storage. Establish a universal storage Schema standard for multiple languages, and at the same time, it can be personalized and expanded based on the characteristics of each language. Taking into consideration the magnitude, query requirements, query speed, etc. of the code parsed data, various storage solutions such as file/neo4j can be flexibly adopted.

3. Caching mechanism. The problem faced during the actual implementation process is that the parsing efficiency of some large modules is slow (more than 1 hour). The analysis found that there are two main reasons: ① There are many code files. ②The analysis process is repeated. On the one hand, efficiency is improved through multi-process concurrent parsing; on the other hand, a caching mechanism is designed to avoid repeated parsing processes, so that complete code front-end data can be obtained by analyzing only part of the file.

picture

5.2 Analysis layer

The analysis layer aims to describe the relationship between code inside and outside the module by abstracting general professional analysis capabilities and solve the problem of high cost at the code analysis level. The main idea is to combine business scenarios, split the capability map, and gradually build the entire relationship network from the inside to the outside.

picture

5.3 Service layer

The analysis layer aims to describe the relationship between code inside and outside the module by abstracting general professional analysis capabilities and solve the problem of high cost at the code analysis level. The main idea is to combine business scenarios, split the capability map, and gradually build the entire relationship network from the inside to the outside.

picture

5.4 Technical effects

  • Explore a set of universal code understanding solutions, build a white-box software knowledge graph, and implement it in C++/GO

  • Basic capabilities: covering multiple languages, efficient and easy to expand

  • Supports 3 languages ​​and 10+ code entity data sources

  • C/C++ efficiency breakthrough, efficiency shortened by nearly 9 times , incremental efficiency**<200s**, reaching the level of groundbreaking implementation

  • Standard-consistent shcema and universal extraction framework, easy to expand

  • Analytical Skills: Diverse

  • 12 general analysis ability

  • Establishment of 20+ common relationships

  • External service capabilities: easy to use and open

  • 3 ways to access data

  • 200+ APIs

  • The business can complete basic white box strategy development at low cost (<1 hour)

Covering 2000+ business code path data, accumulating 1T+ code knowledge data, supporting the incubation and implementation of 10+ mass-effect applications, with an average of 2.4k + calls per day.

6. Typical applications of code understanding in Baidu

As mentioned above, code understanding is widely used in defect detection, intelligent construction, CR, problem location, code repair, etc. Below we will introduce the implementation of code understanding technology in Baidu from several practical scenarios.

6.1 Code understanding application scenarios-intelligent UT

Currently, Baidu has migrated a large number of modules from other languages ​​to GO language, but the recall tool chain in GO language is blank, and it is necessary to build the ability to proactively recall code risks. Smart UT is a common application that proactively recalls risky codes. Unit testing is to test the smallest unit in the code. Traditional UT relies on developers to manually write unit test code for testing, which has shortcomings such as high development costs and the possibility of missing some boundary scenarios. Intelligent UT tools can automatically construct test data and generate unit test code by understanding the content of the function under test and test conditions. Running intelligent UT use cases can achieve active recall of code issues.

【solution】

In order for intelligent UT to understand the function under test and generate high-quality single test cases, it is necessary to obtain information such as control flow, call chain, and data flow through code understanding capabilities to provide analysis capabilities for intelligent UT and understand the content of the function under test and test conditions.

picture

△Code understanding-intelligent UT

【Effect & Benefit】

It supports the incubation and implementation of GO's intelligent UT capabilities. By generating UT and recalling more than 400 effective risk issues in a single Q, the accuracy of intelligent UT use cases reaches 65%.

6.2 Code understanding application scenarios-useless function cleaning

In Baidu information flow and search, useless function cleaning is a typical scenario in technical debt management. The existence of useless code increases the overhead of software development, testing, and troubleshooting. For example, QA and RD require more extra energy to evaluate the impact scope of requirements, and therefore need to be governed. Ideally, if a function is no longer called, that is, a function that has no degree in the call chain can be considered a useless function (island function). However, in actual scenarios, there will be some functions that are used as basic library functions for external calls, as well as scenarios where functions are implicitly called and links are broken. In these scenarios, it will be found that functions without in-degree are not actually useless functions.

【solution】

In order to accurately identify island functions, it is necessary to use code understanding capabilities to filter out functions with 0 degrees in and out based on the call chain, and further combine class, function, macro definition and other information to abstractly model the island analysis capabilities to achieve accurate identification of lonely functions. rate improvement.

picture

△Code Understanding-Lonely Function Identification Application

【Effect & Benefit】

Through code understanding + abstract modeling, the current lonely function identification accuracy reaches 97%, and 57 business code modules have been implemented, helping the business to clean up 7.60,000+ rows of isolated functions.

6.3 Code Understanding Application Scenario-SA

SA (static analysis) is static code scanning, which scans the code through lexical analysis, syntax analysis, semantic analysis and other technologies. Based on the characteristics of the programming language, various risk scenarios can be extracted and transformed into general rules for exception interception. SA rules are developed based on manual experience and rely on a posteriori knowledge, resulting in a bottleneck in problem recall. To solve this problem, Baidu explores AI-SA to use deep learning to let the model learn code semantics and problem defect triggering status, and proactively identify code risk issues. The reason for the AI-SA limited model is that the length of the word discrimination input code is limited, so it is necessary to extract code fragments that can represent defects within the limited number of words and give them to the model.

【solution】

In order to ensure that the code supplied to the AI-SA model is within the length limit, it is necessary to analyze the control flow and data flow through code understanding, and screen the code context fragments that are strongly related to the target variables on the call link to ensure the accuracy of model recall.

picture

△Code understanding-AISA

【Effect & Benefit】

The AI-SA project was built from 0 to 1, and a total of 2,000+ effective risk issues were recalled.

7. Technical considerations for code understanding in the era of large models

The bottom layer of traditional code understanding technology is carried out through AST or IR, etc., while the upper layer relies more on manual rules to mine scenarios, which will bring many disadvantages:

1. There is a large engineering cost in adapting to different languages.

2. The syntax of the language is updated and iterated rapidly, which will cause the underlying parsing to also change accordingly.

3. Application scenarios rely too much on analysis and mining within the scope of people and their understanding, which greatly limits the application of code understanding.

4. Rule-based scenarios can only be judged as 0/1 and have great limitations in recall. We should find ways to use model prediction for scene mining.

Based on the above disadvantages and the development of large model technology, we hope to open up another technical exploration, that is, parsing the most basic code fragments and then inputting them into the large model. The large model can start from defects, positioning, optimization, etc. The angle gives the answer and even the size and type of the risk, which can greatly reduce the engineering investment in code understanding and lift the restrictions on scene mining.

In terms of code understanding capabilities, large models can optimize the storage layer, analysis layer, model layer, etc. of code understanding to provide better code understanding capabilities.

1. Storage layer : Introduce a vector database and store the vector results and contextual dependencies parsed in the analysis layer into the vector database. In addition to the function relationship network and module relationship network for original code understanding, large model vector relationship networks calculated through semantic conversion and Embedding can also be introduced to provide more dimensions of basic data for subsequent code analysis.

2. Analysis layer : Using the language understanding capabilities of large models, file blocks, contexts, etc. can be more reasonably segmented and analyzed. Through the knowledge transfer capabilities of large models, the problem of multi-language migration capabilities of traditional code analysis can be solved. At the same time, tuning students can also use the multi-round dialogue capabilities of large models to fine-tune large code analysis models in specific fields, and establish code analysis models in specific fields at low cost to make them more accurate.

3. Model layer : Introduce GPT-4, Wen Xinyiyan and other related large models to more accurately identify risks in the code and understand functions. Improve the accuracy of risk identification and functional understanding.

In terms of application scenarios, it can help developers and testers learn and understand a piece of code more efficiently, making it easier for developers to refactor and test. The main changes are:

  • Multiple rounds of questions and answers are used to understand the code, reducing the cost of learning the code for programmers and testers.

  • Use large models to provide code refactoring, improvement suggestions and test case design, reducing the labor costs of developers and testers.

  • Use large models to document code and reduce developer labor costs. And manually adjust the documents generated through code understanding, and repeatedly train large models to improve the accuracy of model code understanding.

  • Using the large model to predict the type and size of code risks, this result can be input to the build system to determine the behavior of the build system, thereby improving build efficiency.

Currently, a series of tools that use large models to analyze code have emerged in the industry. After research, it is found that the relatively popular tools in the industry are as follows:

picture

Through a survey of the currently popular large model analysis tools in the industry, we can find that these tools are using large models to improve code understanding and context analysis capabilities, and provide developers and testers with simple code filling and test cases. Design guidance. Perhaps with the development and iteration of large models in the future, developers can complete code development, code tuning, and code testing through multiple rounds of natural language interactions, greatly reducing development and testing costs.

----------  END  ----------

Recommended reading [Technical Gas Station] series:

DeeTune: Design and application of Baidu network framework based on eBPF

Introduction to the basic framework of code-level quality technology

Practice based on openfaas hosting script

Baidu Engineers’ Guide to Avoiding Pitfalls in Mobile Development—Swift Language

Baidu Engineers' Guide to Avoiding Pitfalls in Mobile Development - Memory Leak Chapter

Baidu engineers teach you how to play with design patterns (decorator pattern)

picture

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10116585