[Paper Sharing] How could Neural Networks understand Programs?

foreword

Read a ICML 2021 paper How could Neural Networks understand Programs?

Understanding program semantics is a fundamental problem in programming language processing (PLP). Recent work on learning code representations based on NLP pre-training techniques has pushed the frontier in this direction. However, the semantics of PL and NL are fundamentally different. Ignoring these, we argue that it is difficult to build a model to better understand a program, either by directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model via heuristics. In fact, the semantics of programs can be strictly defined in terms of formal semantics in PL theory. For example, operational semantics, which describe what it means to be a valid program, updates the environment (ie, memory address value functions) through basic operations (such as memory I/O and conditional branches). Inspired by this, this paper proposes a new paradigm for program semantics learning, in which the information learned by the model includes: (1) representations consistent with the basic operations in operational semantics; (2) environmental transition information necessary for program understanding. To verify this proposal, this paper proposes OSCAR , a layered transformer-based pre-trained model to better facilitate program understanding. OSCAR learns intermediate representations (IR) and encoded representations obtained from static analysis to represent basic operations and approximate environmental transitions, respectively. OSCAR demonstrates superior program semantic understanding in many practical software engineering tasks.

In a word: Apply Transformer model for program understanding
Project address: https://github.com/pdlan/OSCAR

introduction

Modern software often contains a large number of codes, functions and modules, with extremely complex structures or organizational schemes. This poses great challenges for writing, maintaining, and analyzing such programs. Fortunately, a series of deep learning-based productivity tools have been developed to automatically help programmers perform tasks such as security audits and code
retrieval by analyzing programs. Inspired by the success of pre-trained representations for natural language semantic understanding, there have been many attempts to integrate traditional The NLP pre-training technique of [1] is transplanted to the source code level, where the code representation is obtained by capturing contextual information from a large amount of source code text, and then used for various downstream software engineering tasks after fine-tuning. For example, CuBERT leverages the powerful pre-trained contextual embedding model BERT to learn information representation on Python corpus; CodeBERT learns general representations by pre-training NL-PL pairs to bridge natural language (NL) and high-level programming language (PL). In addition, expert-designed features (such as data flow graphs) are added to the pre-trained model, aiming to provide additional information for program semantic understanding.
However, there are many fundamental differences in nature between programming languages ​​and natural languages. For example, the same program may behave differently with respect to its input and memory state, and there is no such explicit notion in natural language. Current approaches that attempt to capture semantic features directly from source code, whether applying off-the-shelf NLP pre-training techniques or adding features to models via heuristics, limit the understanding of program semantics.

Inspired by programming language theory, this paper proposes a code representation learning paradigm that enables models to better understand programs. The code representation should be learned from the following aspects: (1) the translation of the source code text should be consistent with the basic operations defined in the operational semantics; (2) the environment change information To verify the effectiveness of the proposal, a new method based
on A pretrained model of layered Transformers , called Operational Semantics for Code Abstract Representation (OSCAR), aims to capture contextual information between long sequences of codes. On the one hand, OSCAR uses an intermediate representation (IR) to represent basic operations. Since the intermediate representation is modeled on an abstract machine with a limited instruction set, it can almost perfectly map to operational semantics, so compared to high-level programming languages More suitable for learning code representation
In particular, IR can be translated from the binary code or source code of the target program. On the other hand, obtaining specific and precise environmental change information requires a lot of actual execution and calculation, which is both unrealistic and risky. Therefore, OSCAR alternately uses abstract information that can be easily obtained by static program analysis inspired by abstract interpretation.
Abstract interpretation (Abstract interpretation) describes the program semantics through the mathematical representation of the program's possible behavior, rather than modeling the behavior of the program after many actual execution trajectories. Furthermore, to capture the control structure of the target program or code fragment, a new Position Condition Encoding (PCE) is proposed to encode the control flow information into the model.

Contributions
(1) propose a new learning paradigm, showing that the pre-training model can learn code representation from surface instructions and underlying environment changes, alleviating the limitation of understanding program semantics from operational semantics.
(2) Propose OSCAR to justify the proposed design, a hierarchical Transformer with IR for basic operations and codes derived from static analysis for approximate environment transitions. An efficient training objective is designed for OSCAR to largely facilitate program semantic understanding.
(3) OSCAR significantly improves the performance of program semantic understanding on a wide range of downstream practical software engineering tasks. Furthermore, OSCAR shows significant zero-shot capability compared to state-of-the-art pre-training methods, i.e., without fine-tuning parameters.

method

A new learning paradigm is proposed, so that pre-training can learn the instruction information of the latent layer, and at the same time learn the potential environment transfer information.
On the simplified abstract machine, the meanings of assignment and combination are expressed as
insert image description here

E, L, and V denote expressions, memory locations, and values, respectively, s ∈ S denotes an environment function that maps all memory locations to values, and C denotes a code fragment.
First look at the expression on the left, which means that if the upper part of the expression is true, the expression E in the environment s is reduced to V, then the lower part of the expression is true, and the program L:= E will update the environment function s with L = V and then look at
it The expression on the right, if the above expression is true, the code segment C 1 C_1C1Execute in the environment s to get s', then the next part of the expression is established, C 1 , C 2 C_1, C_2C1,C2Execute in environment s, equivalent to C 2 C_2C2execute under s'

The semantics of a code fragment depends on two parts: instructions and environment conversion information on the abstract machine. Therefore, this paper proposes to fully learn good code representations from these two parts to better understand the semantics of programs. Below, we introduce OSCAR, a hierarchical model that learns code representations from these two aspects.

Existing program understanding methods have widely adopted learning representations directly from high-level PLs. However, with the development of modern programming languages ​​and compilers, the gap between the textual representation of source or binary code and the actual computational meaning has become wider. This non-negligible gap increases the difficulty of code understanding for existing models. To better analyze and optimize programs, modern compilers translate source code into IR before generating machine code for the target architecture. IR is modeled as an abstract machine, which is usually designed so that each instruction represents only one elementary operation. We collect a large corpus of real-world programs and translate them into LLVM IR as our pre-training data.

insert image description here

Leverage structured operational semantics to illustrate how to encode environmental transformation information into models. The inductive nature of structure-operating semantics requires a well-defined initial condition described by an initial environment function. In order to fully capture specific and precise information about environmental transitions, it is necessary to iterate over many possible combinations of input values ​​and initial conditions, and actually execute programs according to sequential rules to infer transitions. This is clearly not feasible, as the actual execution is time-consuming and risky, such as the analysis of large software projects or malware. Therefore, we alternately use abstract environmental information obtained from static program analysis instead of concrete environmental information. Abstract environmental information, inspired by abstract interpretation, describes program semantics through mathematical representations of the program's possible behavior, rather than modeling the program's behavior after many actual execution trajectories. Applying this idea to structured operational semantics, each expression can be reduced not only to a concrete value, but also to a relation or a possible range in the value space.

Three types of environment relational constraints are extracted from the instructions: constraints controlled by static single assignment (SSA), constraints controlled by memory reads, and constraints controlled by memory writes. This information is readily available through LLVM's built-in profiling functions, such as MemorySSA.

The model architecture of OSCAR is a layered multi-layer Transformer encoder, as shown in the figure below. OSCAR consists of a two-stage encoder. The lower layer consists of two token-level encoders for processing IR and abstract environment information respectively. The upper layer is an instruction-level encoder designed to further extract features based on the output of the lower layer. The implementation of each level encoder is the same as BERT. We refer to the two marker-level encoders as IR and Env encoders. When compiling, compile the source code into binary, then use retdec to decompile LLVM IR, apply the written pass to IR, and generate an LLVM Abstract Environment file, and each IR instruction corresponds to an environment. Since Transformer is to solve the sequence conversion in natural
insert image description here
language Developed for guiding problems, it does not capture well complex control structures in programming languages, such as iteration logic and selection logic. However, control flow information is integral to understanding the semantics of programs. In order to overcome this problem, in the previous work, the control flow graph (CFG) was incorporated into the Transformer.
This paper designs a simpler and more effective method PCE (Positional Condition Encoding), which encodes the control flow information into the model through positional encoding. . PCE assigns 3 learnable embedding vectors to the position of each instruction in the target program or code fragment, which respectively represent the current position of the instruction and the target position after the conditional jump, which are true and false respectively. Figure 2 shows the PCE schematic diagram and control flow diagram corresponding to this code fragment, where pi p_ipi p i 1 p^1_i pi1and pi 0 p^0_ipi0Respectively represent the position iiThe learnable embedding of the instruction of i at the current position, the learnable embedding of the true jump position, and the learnable embedding of the fake jump position.
insert image description here

From Figure 2, we can see that PCE can incorporate the outgoing edge information of nodes in CFG into the attention module, and the incoming edge information will also be captured after calculating the position correlation in Equation 2. This shows that OSCAR can capture all the information of CFG with PCE, even if CFG is not explicitly input into the model.
insert image description here

How optimization-based contrastive learning can effectively capture program or code fragment-level semantic knowledge during pre-training is undoubtedly crucial for code representation models. However, previous work has not studied it well. In fact, modern compilers support various compilation options to meet different optimization needs, such as minimizing execution time, memory footprint, storage size, etc. A single source code can be translated into a contrasting IR using different optimization techniques without changing the meaning of the code. Of course, different combinations of various optimizations can be used as data augmentation methods for source code. Inspired by this, this paper proposes [CLS]-labeled targets with momentum encoder contrastive learning as a self-supervised task for OSCAR to better facilitate semantic understanding from the program level.

experiment

OSCAR is pretrained on a large corpus of real programs from publicly available open-source GitHub repositories, covering a broad range of disciplines from operating systems and compilers to machine learning systems and linear algebra subroutines.
insert image description here

In this section, we evaluate the performance of OSCAR on several program semantic understanding tasks. We first execute our model on a practical and important software engineering task, Binary Diffing. Then, we evaluate OSCAR's performance on high-level PL understanding on an algorithmic classification task. Furthermore, as a pre-training method, the performance of OSCAR in zero-shot learning is investigated, where the parameters of OSCAR are fixed. Finally, the components of the model in the ablation study are analyzed.

Bin-diff

insert image description hereAs shown, OSCAR far outperforms BinDiff, Asm2vec, and BinaryAI at all optimization levels for the five programs. For example, in the most difficult matching case, the difference between O0 and O3 optimization levels, OSCAR improves the recall of all baseline techniques on each procedure.

Algorithm Classification

In this subsection, we investigate the performance of OSCAR on high-level programming language understanding. The experiment is carried out on the POJ-104 dataset [1] which contains 104 algorithm problems.
insert image description here

Compared to all previous methods, the model achieves a large improvement, which shows that OSCAR can well understand the semantics of source code written in high-level PLs.

Zero-Shot Learning

We further investigate the performance of pretrained OSCAR in the zero-shot learning setting, i.e. evaluate OSCAR without modifying parameters.
insert image description here

As shown in Table 3, both pretrained OSCAR and OSCAR1−6−1 show good performance in code similarity detection compared to other pretrained models without further modification of parameters. This suggests that OSCAR has the potential to be portable to downstream tasks without fine-tuning.

Ablation Study

In this subsection, we will use BusyBox to study the impact of each component in OSCAR on the Binary Diffing task.

insert image description here

Table 4 shows the ablation experiments of the two components of OSCAR, contrastive loss and PCE. As shown in the figure, all components are beneficial and improve the recall rate of the binary diffing task. Meanwhile, we further train BERT on the IR corpus, which is similar to CuBERT [2] because they share exactly the same architecture, the only difference is that CuBERT is pre-trained on the Python corpus. Experimental results show that CuBERT performs poorly on the binary diffing task of IR, which reflects the significant benefit of OSCAR's hierarchical structure.

Summarize

Related Works

[1] Mou, L., Li, G., Zhang, L., Wang, T., and Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[2] Kanade, A., Maniatis, P., Balakrishnan, G., and Shi, K. Learning and evaluating contextual embedding of source code. 2020.

Insights

(1) The role of the Model is not significant, and more attention should be paid to data processing. Consider using IR or pseudocode (results decompiled by IDA) (2) Generally speaking,
if you use IR or decompiled pseudocode, you should be more In the form close to compilation, model training as a data set may have better results? However, the current difficulty lies in the generation of IR, which needs to be studied urgently.

Supongo que te gusta

Origin blog.csdn.net/qq_33976344/article/details/123703826
Recomendado
Clasificación