SCA Technology Advanced Series (2): Application of Code Same-Source Detection Technology in Supply Chain Security Governance



424b4d3d6ad341f5d8c13a80ab3e4c57.jpeg

4c9ff1fcfa498d9548622541f2bfc055.jpeg

0d283df54e8c3a1a6e8bb117fbc5190d.png

With the continuous acceleration of the construction of "Digital China", enterprises continue to increase the application of open source technologies in the innovative practice of digital transformation. The introduction of open source components to complete the development of application requirements has become the main means for most R&D engineers to develop software codes. A pain point that comes with it is that the vast majority of applications contain open source component risks. Therefore, the SCA technology that can help manage and reduce the risk of open source components emerges at the historic moment.

Conventional SCA software component analysis tools can complete the inspection of referenced third-party open source components by analyzing component versions and dependencies, thereby identifying known component vulnerabilities and licensing risks. However, component vulnerabilities originate from security flaws in code writing. For scenarios where some code fragments that refer to the source code of open source components lead to the introduction of code with security flaws, it is necessary to use code homology detection technology to check. Code homologous detection is based on the dimension of the source code file to perform component analysis on the source code. It is mainly used for code traceability analysis, known code vulnerability analysis, malicious code files, etc., and can accurately analyze the referenced open source software and its associated information. This article will analyze the technology from the three aspects of code homology detection technology principle, core technology and common application scenarios.

53a16a5fd36d957e316378dbc438c3e5.png

Homology detection is homology analysis, which refers to the homology analysis of the components in the application program or software. According to the analysis accuracy, the homology detection can be divided into file level, function level and fragment from low to high accuracy. class. The code homology detection technology is mainly used to detect the same code components of a certain fragment of code in the application source code and other fragments of code or open source code in the project, so it is also called code clone detection.

Code Clone refers to the presence of multiple identical or similar source code fragments in a local code base or an open source code base. The use of clone code is also a way to improve development efficiency in the code development process, which helps the development of software systems to a certain extent, but this method may also accidentally introduce negative effects such as security risks or licensing risks in the code fragment itself. With the continuous iteration of software in the agile development mode, code cloning causes the continuous expansion of the code base, which increases maintenance costs without good cloning management. Software defects will also be propagated in the system due to code cloning, reducing the reliability of the software system. Therefore, the use of code clone detection technology can meet the requirements of SCA to detect and analyze open source components introduced in applications, analyze whether open source components have known security vulnerabilities, analyze open source licenses declared by applications or open source components, and trace source code analysis and other core functions.

e1e1c1ad860eedffe2d0759d9930aa1f.png

Figure 1 General implementation of homology detection

51ef4f96d3e19858eb8743829e792713.png

Code cloning in the strict sense refers to the overall cloning of the original code fragment or code file. The code developer directly uses the original code fragment or file, and the content of the two is exactly the same without any modification. However, from the perspective of actual application requirements, the types of code cloning are usually divided into four categories:

Type 1, full clone: ​​Except for comments and blanks, the two code fragments are identical.

Type 2, renaming and cloning: modify the variables, types, text and function names of the code, and the logical content of the two code fragments is consistent.

Type 3, adding, deleting, modifying and cloning: On the basis of type 2, some code statements are added, deleted or modified, and the content layout of the source code is modified. The content of the two code fragments is similar.

Type 4, self-implementation cloning: the logical functions of the two code fragments are the same, but the specific coding implementation methods are different, for example, by replacing the same type of function or expression, and its time complexity is consistent with the input and output.

For detection methods, type 1, type 2, and type 3 are mainly realized through text similarity detection technology, and type 4 needs to be detected through functional similarity. The later the type, the more difficult it is to detect.

acc16ff7075d22f08466c91141685a0a.png

Figure 2 Example display of four types of code cloning

7b06d467e5f46de67bb9cca964291916.png

In order to effectively realize the detection of code cloning, the main technology includes code format conversion and similarity determination. For different code clone detection tools, the technical principles implemented have certain differences, but their main execution processes are roughly the same, which can be summarized as the following processes:

b6ea0564ccb830635757f875c5148814.png

Figure 3 code homology detection process

  1. Open source/closed source code library: When used as the main knowledge base basis for code clone detection, collect enough and complete open source/closed source code projects in advance, and use specific algorithms to form a collection of knowledge base feature tables; when used as the detection target, through preprocessing Nonsensical code fragments are removed and transformed, specific similarity comparison methods are performed, and clone detection results are obtained.

  2. Preprocessing and conversion: For software supply chain security detection scenarios, it is necessary to preprocess defect codes and core core codes, remove meaningless source code parts, and standardize codes. By dividing source codes into different fragments, they can be converted into comparable units. There are many conversion methods of the comparison unit, depending on the specific detection principle.

  3. Source code characterization: In this step, the source code can be represented as text, or further represented by symbols for convenient storage of records or subsequent comparison and verification. A more in-depth representation method also includes converting the source code into an abstract syntax Tree (AST, Abstract Syntax Tree), etc.

  4. Code similarity comparison: In this step, each code fragment will be compared with other code fragments to find code clones, and the comparison results will be presented as a list of clone pairs. Among them, the similarity comparison algorithm is largely determined by the source code representation method. For example, if AST is used as a source code representation method, this type of source code representation method will determine which appropriate similarity algorithm to choose.

  5. Code clone result integration: This step is mainly to associate the code clone obtained in the previous steps with the original source code and present it in an appropriate way. Submit the detection results to the demand side, and provide the source code owner for reference, requesting removal or rectification.

d5105894c33c7abe4476aa5538213960.png

1. Text-based code clone detection method

Based on preprocessed source code (removal of whitespace, comments, etc.), code clones are directly detected using text similarity detection algorithms. When text similarity detection is based on source code comparison, the text-based detection method can cover code clone detection of type 1 full clone and type 2 renamed clone. However, text similarity detection is not limited to pure source code comparisons. Because the source code itself is meaningful, a lot of information will be lost if it is directly processed as text. Therefore, after extracting the feature fingerprint of the source code, it is also a comparison of two string texts. Using a special algorithm, it is possible to cover the addition, deletion and modification of type 3. If a similarity algorithm based on natural language processing can be implemented in the later stage, in theory, type 4 self-implementation cloning can also be covered, but this step may take a long time.

2. The token-based code clone detection method
uses a lexical analyzer to divide the source code into token sequences, and then finds similar subsequences in the token sequences. The token-based detection method performs lexical analysis on the source code, the symbol sequence is more in line with the compilation principle, and the source code information is more fully utilized. But it lacks the analysis of code syntax and semantics, and the detection effect on Type 3 and Type 4 code clones is not ideal.

3. The tree-based code clone detection method
is to represent the source code as an abstract syntax tree or a code parse tree, and then use a tree matching algorithm to find similar or identical subtrees to detect clone code. The method analyzes the syntax of the source code, further improves the utilization of source code information, can better detect type 3 code clones, and improves detection accuracy.

4. The metric-based code clone detection method
extracts source code-specific index indicators (such as the number of codes, the number of variables, and the number of loops), abstracts them into feature vectors, and then determines the clone based on the distance between feature vectors. method has a great advantage in speed.

5. The graph-based code clone detection method
converts the source code into a Program Dependence Graph (PDG, Program Dependence Graph) composed of a data flow graph and a control flow graph, and realizes clone detection by finding homogeneous subgraphs. The graph-based clone code detection method not only utilizes the grammatical structure of the source code, but also considers the semantic information of the source code to a certain extent, so this method can detect type 4 code clones. However, due to the high time and space complexity of program dependency graph generation algorithms and isomorphic program dependency graph subgraph matching methods, graph-based code clone detection methods cannot be applied to code clone detection of large software systems.

ab2bd66c74324a416e0431a004cd894c.png

Different code clone detection methods are suitable for software systems of different scales, programming languages ​​and structures. In order to evaluate the ability of detection methods, the following evaluation indicators are generally used:

Recall: The ratio of the number of all detected code clones to the total number of code clones.

Recall = TP / (TP + FN)

Detection accuracy: refers to the ratio of the code clone detected by the clone detection algorithm to the real code clone.

Detection accuracy = TP / (TP + FP)

Expression description: TP represents the intersection of clone fragments detected by a certain code clone detection method and real code clone fragments, FP represents the collection of code clones, and FN represents the collection of real code clone fragments not detected by the detection method.

Similar to conventional vulnerability detection tools, OWASP Benchmark serves as a shooting range project for evaluating the detection benchmark capabilities of vulnerability scanning tools. Regarding the field of code clone detection tools, there is also a corresponding shooting range project for code clone detection effectiveness evaluation. for example:

1. Bellon's benchmark, published in 2007: Run six different code-cloning tools against two small C programs and two small Java programs, and compare these results with the main body of real code clones to create code-cloned data set.

2. BigCloneBench released in 2015: It is a collection of 8 million verified code clones in IJaDataset-2.0 (a big data software repository containing 25,000 open source Java systems). There are four main types of code cloning within and between projects.

ba7a6e3d9f781109faa8389d72358612.png

Combined with the pain points and needs of industry and enterprise users, the same-origin detection technology based on the underlying technology of code cloning detection can be divided into three categories: code traceability analysis, code known vulnerability analysis, and malicious code file analysis.

5c3fe5a328a66a8a6e24411ebbae1845.png

In SCA, the code traceability analysis technology aims to analyze the open source code by detecting the target code, tracing the detailed information of the third-party open source project referenced by the target code, including the files and code lines of the matching project, and combining the license declared by the third-party open source project. Introduce whether there will be intellectual property risks such as compatibility and compliance.

Based on code feature extraction methods such as similar hash precise matching, the technology integrates and calculates code features to generate code fingerprint information, and combines code-level big data fingerprint databases for association, matching, and analysis.

27ff99e88a9709a855e7b9663d98ef41.png

The third-party open source code introduced by developers during the development process may have known vulnerabilities that can be exploited by attackers. According to the survey, more than 80% of the vulnerable files have the same source file in the open source project, and the scope of the vulnerable file has expanded by 54 times under the spread of the open source project, as shown in Figure 4.

26aa31b13c95b2b2cb87a411711a92d6.jpeg

Figure 4 The proportion distribution of the same source of vulnerable files

The open source project with the same source as the current code is identified through the homology detection technology, and combined with the vulnerability database information, it can be detected whether the current code comes from a vulnerable open source project, whether the current code comes from a vulnerable version of the open source project, and whether the current code involves vulnerability-related issues. code.

038f85eb427623bde1290ea51917863e.png

At present, more and more security incidents are caused by attackers intentionally submitting malicious code and releasing updates in the open source community, or adding malicious dependencies to open source projects, or abusing package managers to distribute malware and other new attack methods. For example, in the NPM package, eslint-scope released a version containing malicious code due to hackers stealing the developer account, and event-stream added malicious dependencies to the project due to hackers mixing into project maintainers. Therefore, detecting and identifying malicious code in source code is a necessary function to satisfy SCA.

SCA is to identify the malicious code in the source code by extracting the characteristic data of sensitive behavioral functions from the source code and comparing it with the characteristic data of malicious code collected in advance.

Same-origin detection technology is an important foundation of SCA technology. Yuanjian SCA has the core capability of in-depth code homology detection, which can accurately identify third-party open source components referenced in the application development process, and extract features of open source components in multiple dimensions through application composition analysis. Calculate component fingerprint information, deeply mine various security vulnerabilities and open source protocol risks hidden in components, and fully cover supply chain security review, software compliance review, third-party component security control and other industrial application scenarios.

3584cb5b3f07359af33d23dd3e387fe2.png

2265bbf7ad22c9807fa30a078f137608.png

For applications developed in the agile mode, since different R&D personnel are responsible for the development of different microservice applications, the implementation of the same functional code often exists unconsciously. Identifying redundant codes of the same type through code homology detection technology can not only facilitate the elimination and management of code reuse, provide traversal for subsequent maintenance, but also facilitate unified repair when code defects occur, and further ensure code quality.

For example, when different developers are developing their own applications, it involves repeated coding and implementation of the same function. By performing similarity detection on the codes, similar codes can be integrated into an SDK package, which can not only be maintained uniformly, but also be used by other developers.

073c26a7c467c52e0669dced94aaabde.png




8c1de60e8a34a54b1db42bb3e5ddf98f.png

Figure 5 The picture comes from the SCA tool of Yuanjian

d45d86e09b8d94694b8311cefcffb7b5.png

Code homology detection technology can be used for fragment code risk detection. Fragment code risk detection mainly refers to identifying code fragments with potential security risks or vulnerabilities in a code base or project for repair or security enhancement. Enterprises use third-party open source components to develop key functions, and often do not use them directly. They need to conduct secondary development of the source code of open source components according to business needs. For open source components after secondary development, it is difficult for conventional SCA tools to identify their existing risks, and the code homology detection technology can be characterized by defective code fragments to detect open source components after secondary development and script codes that lack version characteristics. Component Vulnerability Correlation.

For example, when conventional SCA technology correlates component vulnerabilities with jar extension open source components, it mainly judges by its version number and component fingerprint identification. Open source components after secondary development destroy part of the fingerprints, which may lead to failure to identify them. The code homology detection technology can identify whether this type of component has the corresponding component vulnerability risk according to the risk code fragments related to the component vulnerability.

Fragment code risk detection through code homology detection technology can quickly discover and fix potential security problems and improve code security and reliability.

45ba5ee9e3b5f7b5d4b853ac681dc835.png

Open source components are not free components. The use of open source components must strictly abide by the open source license agreement. If the cloned code is used against the authorization of the author of the open source project, it will still be bound by the open source project license agreement. Code knowledge infringement refers to plagiarizing or copying other people's codes or algorithms when writing software, which infringes the intellectual property rights of others. Common forms of code knowledge infringement include plagiarizing open source software, copying other people's code, and misappropriating algorithms. For their own commercial software, enterprises need to regularly sort out and check the use of open source component codes through code homology detection technology, especially for the use of scripting languages, to ensure that their own software does not have the risk of open source license agreements.

For example, when junior developers use some fragments of open source code that restrict commercial use without enough awareness, resulting in the risk of application authorization restrictions, code homology detection technology can help identify this type of risk. Code homology detection technology can be used to identify code knowledge infringement review, help developers maintain their own intellectual property rights, and at the same time avoid infringement of other people's intellectual property rights, and improve the legitimacy and credibility of the code.

b6165961dcb44efd707dda86ee8c9b39.png

Figure 6 The picture comes from the source identification SCA tool

f2e5e157e5067521a9ddfb59e90deca9.png

In the process of building a secure development system, secure coding specifications are an important part of improving coding quality. The use of standard security codes can be reviewed through code homology detection technology to check the application of standard security codes to ensure that the applications developed by the R&D team have basic security and robustness.

For example, after an enterprise compiles standard security codes, it is hoped that R&D personnel will use them in the development process as much as possible. The code homology detection technology can help to count the usage coverage of this part of the standard code and help subsequent promotion.

df6a05eab22bb86413fef3c9a6a46e99.png

Under the national requirements for the independent and controllable environment of enterprise software, the analysis of software source code self-development rate will become an important reference index, and the development of the core backbone program of the application system needs to strictly follow the principle of self-controllability. Source code same-source detection can not only show whether code fragments may have security risks, but also help identify the rate of self-developed application source code and improve acceptance standards when enterprises accept outsourcing or cooperative development of code.

1fc0dafafdbc8a1f43af9214403b99a2.png

Figure 7 The picture comes from the source identification SCA tool

b9fb3db1228415b42f0f40816a31bfc8.png

With the application of ChatGPT, developers gradually began to use this technology to automatically produce usable code. However, due to the fact that AI itself collects samples for learning, the automatically generated code may have some code clones and contain the risk of not complying with the license. Code homology detection technology can help increase security risk audit capabilities when code is generated.

3c980605519e1bd8a8cd75a9c2fb0ca5.png

The main value of homologous detection technology is reflected in helping organizations identify and inventory open source software components, and detecting whether there are known code vulnerabilities or malicious codes, and helping software asset management solve "unclear" from the file level and code fragment level And the main pain points of "unfathomable" to help ensure the security and reliability of supply chain components in enterprise organizations.

Yuanjian SCA combines its own technical advantages such as rich knowledge base samples and multiple time-compressed code fragment-level detection algorithms to optimize the technical characteristics of homologous detection, and can achieve the following functional features:

1. Based on rich knowledge base samples: the knowledge base covers mainstream code hosting platforms GitHub, GitLab, BitBucket, Gitee, Codeberg, etc., and the number of open source projects covered exceeds 8KW+;

2. Code fragment-level detection algorithm based on multiple time compression: tens of millions of code fragment data, as long as the second level, close to the file-level similarity detection time;

3. Accurate positioning: verify the execution results of the code fragment level algorithm, and can clearly locate the specific open source project address, version, file path name and line number;

4. Flexible self-adaptation: the similarity threshold can be adjusted to reduce the probability of false positives.

On the basis of satisfying source-level same-source detection technology, Yuanjian SCA combines binary SCA technology, runtime SCA technology and component vulnerability hot repair technology to effectively help developers better manage and maintain software components and improve software security. Safety and reliability, helping enterprises establish and effectively implement a digital supply chain security governance system to ensure the security of the digital supply chain.

Reprinted from丨Xuanjing Security

Editor丨Shao Kejia

Related Reading| Related Reading

deb1207e4f6110bdbf32b385517005f7.jpeg

Invitation to participate: 2023 Open Source Security International Exchange Forum

e793d96e37b42adffe0d689d76757e55.jpegWhy does the operation of domestic open source communities always have a peculiar style of painting?

Introduction to Kaiyuanshe

Founded in 2014, Kaiyuan Society is composed of individual members who voluntarily contribute to the cause of open source. It is formed according to the principle of "contribution, consensus, and co-governance". It has always maintained the characteristics of vendor neutrality, public welfare, and non-profit. International integration, community development, project incubation" is an open source community federation with the mission. Kaiyuanshe actively cooperates closely with communities, enterprises and government-related units that support open source. With the vision of "Based in China and Contributing to the World", it aims to create a healthy and sustainable open source ecosystem and promote China's open source community to become an active force in the global open source system. Participation and Contributors.

In 2017, Kaiyuanshe was transformed into an organization composed entirely of individual members, operating with reference to the governance model of top international open source foundations such as ASF. In the past nine years, it has connected tens of thousands of open source people, gathered thousands of community members and volunteers, hundreds of lecturers at home and abroad, and cooperated with hundreds of sponsors, media, and community partners.

45b6f03f78e7c38fd7a754d3d62de678.gif

Guess you like

Origin blog.csdn.net/kaiyuanshe/article/details/131407901