[论文分享] Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking

Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking [TSE 2023]

Huaijin Wang , Pingchuan Ma , Yuanyuan Yuan , Zhibo Liu , Shuai Wang Department of Computer Science and Engineering, HKUST, Clear Water Bay, Kowloon, Hong Kong.
Qiyi Tang , Sen Nie , and Shi Wu Tencent Security Keen Lab

Binary code function search is the core foundation of various security and software engineering applications, including malicious sample classification, clone code detection, vulnerability auditing, etc. Identifying logically similar assembly functions remains a challenge. Many binary code search tools rely on the structure of the program information, such as control flow graphs and data flow graphs extracted by program analysis techniques or deep neural networks. However, DNN-based techniques capture binary code vocabulary, control structure, or data flow-level information for representation learning, which are usually too Rough, cannot accurately represent program functions. Furthermore, it may exhibit low robustness to various challenging settings such as compiler optimization and obfuscation. This paper proposes a method to improve the function search of binary code based on dnn A general solution for top-k ranking candidates in . The key idea is to devise a low-cost and comprehensive equivalence check that quickly exposes functional deviations between the target function and its top-k matching functions. Failure to pass this equivalence Functions checked for consistency can be removed from the top-k list, and functions that pass the check can be revisited in order to progress on the top-k candidate functions in an intentional way. The authors design a practical and efficient equivalence check, Named BinUSE, under-constrained symbolic execution (under-constrained
symbolic execution, USE) is used. USE is a variant of symbolic execution that improves availability by starting symbolic execution directly from the function entry point and relaxing constraints on function parameters. Scalability. It removes the burden caused by path explosion and overhead constraints. BinUSE aims to provide assembly function-level equivalence checking, enhancing dnn-based binary code search by reducing false positives at a lower cost. Evaluation shows that, BinUSE can universally and effectively enhance four state-of-the-art dnn-based binary code search tools when faced with the challenges posed by different compilers, optimizations, obfuscations, and architectures.

Bottom line: BinUSE uses underconstrained symbolic execution to perform equivalence checks on binary functions searched by DNNs, reducing false positives to enhance search results.

introduction

With the vigorous development of machine learning technology and its wide application in downstream tasks such as software embedding [6], [7], most contemporary binary code search tools aim to train a machine learning model to capture binary code similarity [ 8], [9], [10], [11]. In particular, recent advances in deep neural networks (DNNs) and representation learning have made it possible to train DNN models to learn optimal code representations that are able to distinguish similar Assembly functions [12], [13], [14], [15], [16], [17]. To learn code representations, DNN models are trained using (lightweight) vocabulary, control structures, or dataflow-level features . Such representations, although easy to extract, may not preserve program semantics to a great extent. The lightweight features are generally not robust to challenges such as compiler optimization or obfuscation, which makes semantically similar assembly code look are quite different. Therefore, the DNN model may exhibit lower discriminability and less robustness, leading to a large number of false positives in the top-k candidates it retrieves. This paper aims to use an
effective principles and efficient methods to enhance binary code function search. Given the objective function ft f_tftand function library RP, using low-cost equivalence checks to quickly identify ft f_t in RPftSemantically deviated functions are thus removed from the retrieved top-k ranking candidate functions. Therefore, it can be reconsidered whether to include the functions that pass the inspection in the retrieved top-k candidate functions. For dnn-based binary function search tools The main results of weighting are shown in Table 1.

insert image description here

This work provides effective enhancements to four SOTA DNN tools, even though these tools use different neural network models and learning methods. In order to design a low-cost and practical assembly function equivalence test method, this paper adopts constraint Solving and underconstrained symbolic execution techniques [19] to construct and verify the input-output relationship of assembly functions. Compared with standard symbolic execution, USE aims to perform flexible and fast symbolic reasoning directly from the function entry point, skipping the Expensive path prefixes to target functions. Optimized the standard usage scheme BinUSE as a utility, especially for equivalence checking of assembly functions. BinUSE initiates a USE traversal from the function entry point, traversing each path until reaching the first The external function call point represents an information-rich and critical node on the CFG. Then, BinUSE uses the symbolic formula input by the external call point to form symbolic constraints on each path, and collects matching Symbol constraints to match two functions.

Contributions
(1) At the conceptual level, a new focus is proposed to enhance the binary code function search based on dnn with low accuracy. This paper does not design a new dnn (in principle, it is difficult to accurately capture semantics), but designs a A low-cost equivalence check to mark and delete assembly functions that deviate semantically from the objective function
(2) At the technical level, an equivalence check method is proposed to further reduce its cost by optimizing the standard usage scheme . This equivalence check is specifically designed for assembly functions, taking into account various technical challenges and optimization opportunities, such as collecting symbol constraints on external call sites accessible by function entry points to reduce complexity (3) Experimental
results It is shown that the designed equivalence test is general and effective, and can enhance the dnn-based binary function search tool at a lower cost. Under various challenging conditions, including general equivalence function matching and CVE search , the equivalence test shows excellent performance

Preliminary knowledge

Formulation and Metrics

The existing semantic-aware function search work is given the target assembly function ft f_tft, and a function library RP RPRP , the search engine will obtain the k functions with the highest semantic similarity asft f_tftThe recognition result of is Top-k. The index to measure the accuracy of top-K is calculated by the following formula:
1 N × ∑ i = 1 npk ( fi ) \frac{1}{N} \times \sum_{i=1 }^n p_k\left(f_i\right)N1×i=1npk(fi)
where N is the total number of functions in the program. Simply understand the meaning of this formula, and iteratively put binaryB in 1 Bin_1Bin1The function fi f_i infiAs the objective function, to query with B in 2 Bin_2Bin2The function set composed of all functions in RP RPRP , let it correctly matchfi f_ifiThe function of is fi ′ f_i'fi, 当 f i ′ f_i' fipk ( fi ) = 1 p_k(f_i) = 1 in top-k candidatespk(fi)=1 , otherwise it is 0.
Another commonly used indicator is called mean reciprocal rank score (MMR)
MRR = 1 ∣ Q ∣ ∑ i = 1 ∣ Q ∣ 1 rank ⁡ i \mathrm{MRR}=\frac {1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\operatorname{rank}_i}MRR=Q1i=1Qranki1
where ∣ Q ∣ |Q|Q is the total number of queries,ranki rank_irankiis the sequence number of the correct result in the top-k candidates in a query, for example, the correct result is ranked 4th in top-k, then ranki = 4 rank_i = 4ranki=4 , The bigger the MMR index, the better the effect.

Equivalence Checking

In addition to the popular DNN-based representation learning, another research direction is to perform code equivalence checks, using program input-output relations obtained through symbolic execution. Given a symbolic formula representing the binary code input-output relations, then use constraint solving The compiler checks the equivalence of symbolic formulas. Equivalence checking is resilient to challenging settings such as compiler optimizations and obfuscations, since these settings should not change the program's input-output relationships. An example
insert image description here

The above two code snippets can be equivalent to the next two statements, then the equivalence check can check a = m ∧ p ≠ sa=m \wedge p \neq sa=mp=s constraint is satisfied in the input domain, if all inputs do not satisfy this constraint condition, then the two code fragments are said to be strictly equivalent.
LimitationsThe proposed technique gives a strict proof of program equivalence. However, Symbolic execution and constraint solving are less scalable due to path explosion, reasoning about complex constraints, and binary analysis domain-specific challenges. So far, methods based on equivalence checks have been mainly used for basic block or execution trace comparisons. Binary
classificationTwo standard errors for equivalence checking: false positive (FP) and false negative (FN), the former is that functions of different functions are considered equivalent, and the latter is that functions that are equivalent are considered not equivalent. Although Equivalence checks cannot be used directly to compute top-k accuracy, but equivalence checks enabled by USE can be used to eliminate false alarms from DNN models.

Under-Constrained Symbolic Execution

Underconstrained symbolic execution is proposed to perform the inspection of arbitrary code fragments, reducing the complexity of symbolic execution in principle and system. To illustrate the high-level technical differences between SE and USE (in terms of path coverage), Fig. 2a presents a Analyze the case of the message decoding program, and identify a bug in decoding_msg. The main function uses receive_msg to receive the encoded message, and executes the decoding process in a loop statement. Then, the decoded message is passed to the function decoding_msg, one of the bugs (labeled bug at line 15 in Figure 2a) is hidden in the if branch. SE may be hindered in analyzing this simple case due to high computational resource usage and lengthy constraints. USE in a principled manner Reduced complexity. In order to achieve bugs, USE directly analyzes decoding_msg. The resulting path is shown in Figure 2c, for the decoded message msg 22 msg_{22}msg22No complex constraints are imposed, and constraints that are easier to resolve may be introduced. More expensive analysis of the whole program can be postponed until needed.

insert image description here

Limitations USE By relaxing the constraints on the input, it is possible to find satisfactory solutions that are not actually valid when considering the path prefix from main to the target code fragment. Likewise, success in finding a satisfiable A solution for means that the two code fragments fail the equivalence check. In general, USE provides in principle a complete, valid but unsound equivalence check, possibly leading to false negatives which are often undesirable.

motivation

DNN-based methods learn code representations from vocabularies, control structures, or data flow facts. A well-trained DNN model converts input binary samples (or machine instructions) into numerical vectors, where two similar programs should have closer cosine Distance. DNN-based methods mainly learn "fuzzy" and lightweight data and control features, which have high flexibility and scalability, and facilitate the analysis of large-scale binary samples. However, the learned vocabulary, control or Data features do not necessarily accurately represent functions. In summary, we believe that the learned embedding representations mainly have the following two defects: (1) Low discrimination: DNN models can treat logically different functions as similar. Therefore, low Discrimination results in more reported FP matching results; (2) Low robustness: Robustness refers to the resistance under various imperfect conditions when running software or algorithms. Low robustness means that the DNN model may Difficulty matching functions that share the same logic but are syntactically different. In general, low robustness results in reporting more FN matching results.

insert image description here

Figure 3 shows the comparison results, where the red dots represent the similarity scores between assembly functions compiled from the same source function, and the blue dots represent the similarity scores between other functions. Ideally, the high similarity scores of the red dots shows the robustness of the DNN model to compiler optimization: assembly functions compiled from the same source function have the same semantics despite exhibiting different syntax forms. Therefore, when the blue dots are well separated, it reflects the Distinguishability, which means that assembly functions compiled from different source functions show huge differences in the view of the DNN model.

insert image description here

Figure 4 gives a wrong prediction: two assembly functions compiled from different source codes are mistakenly considered as "similar". This may be due to the similarity of the "context flow graph" extracted by ncc, which cannot reflect the function deviation.

Program overview

Modern DNN-based binary code search learns code representation from coarse-grained features, which has low discriminability and low robustness. Therefore, usually top-k matching functions have similar similarity scores, while ground truth The matches may not have a high enough similarity due to challenging settings such as compiler optimizations or obfuscation. Our preliminary study manually checked the top-k matching results of these DNN models, and we suspect that a simple equivalence check can effectively reduce FPs. For example, we can give the objective function F t F_tFtProvide the same input as the functions in top-k, and compare their outputs to see if they are different.

insert image description here

Figure 5 depicts an overview of our workflow for augmenting top-k retrieval with DNN-based tools. In short, this study aims to provide a low-cost equivalence check that makes sense of whether two assembly functions are identical. true/false judgments. In this way, we can condition the search results of dnn-based tools by reducing the FPs in the top-k retrieval. For example, many "blue dots" in the motivating example (Fig. 3) can be solved by simply executing Input two functions and compare the equivalence of their outputs to determine that they are different from the target function. With this low-cost equivalence check, these FPs can be easily removed from the top-k retrieval. Check the software
's Equivalence is expensive in general, so as a practical trade-off we accept relatively low-precision equivalence checks in the sense that two functions that pass the check may still be different. Thus, by Adding low-cost equivalence checks to DNN-based binary code searches can lead to real synergies, leading to faster service and higher precision.

insert image description here

1 Execution-specific equivalence checks use randomly sampled values ​​as input, and compare the outputs of specific executions. This approach may only cover a small input space, and by treating different functions as equivalent, may lead to high FPs. And setting an appropriate execution context when directly executing assembly code is a very challenging problem.
2 Equivalence checks based on symbolic execution accurately model input constraints and construct equivalence checks within legal input space . Therefore, it should be accurate. However, the scalability of SE is low, and its execution can hardly reach functions hidden in the call chain. 3
Equivalence checking based on underconstrained symbolic execution USE can start symbols at arbitrary program points Inference, it improves standard SE by skipping expensive path prefixes. Path prefixes are then ignored, indicating that USE cannot model the constraints on the input of the target code fragment. Therefore, USE overexplores the full input space. USE can perform complete but Unreliable equivalence checking, because it may find counterexamples outside the legal input space, thereby violating the equivalence checking constraints.
4 Functional equivalence checking based on underconstrained symbolic execution Although there are general difficulties in solving unreliable , but our observations of real-world software led to a key assumption in this study: functions in real programs follow the defensive programming principle [35], [36], [37], which states that no particular function Should make assumptions about its input (for example, the pointer passed by the caller may be invalid). That is, the input of the function can be any value in the input space. This assumption provides a unique opportunity to provide a "reliable" Equivalence checking, especially for functions, because when analyzing functions in real software, the legal input space should be aligned with the full input space that can be explored by USE. Additionally, this principle is adopted by popular program suites (e.g. coreutils, binutils ) and programmers of complex real-world software such as OpenSSL and Wireshark follow.
5 BinUSE-based function equivalence checking trades completeness for speed to further optimize the standard USE method. BinUSE calculates the symbolic formula input by the external call site to form symbolic constraints for each path. In order to match two functions, BinUSE explores the Matching sign constraints collected for each path in . BinUSE is not complete. However, according to empirical results, the average FN rate is very low. Furthermore, BinUSE can perform low-cost checks under various challenging settings (e.g., cross-architecture). BinUSE Examining two coreutils executables can be done in 56.6 CPU minutes (25 seconds per pair of functions on average), including all symbolic execution and constraint solving tasks.

BinUSE Design

Given an input executable, BinUSE first performs reverse engineering to recover assembly function information. Then, it starts USE path by path, starting from the entry point of each assembly function, where each path traverses before reaching the first external call point. The result will generate a subgraph in which each leaf node corresponds to an external call site. To compare the two assembly functions, we cross-check the semantic equivalence of external call site inputs and path constraints by initiating constraint solving, to compare their derived subgraphs (see below for details).

insert image description here

The reverse-what-if analysis is at the function level, and it does not assume the presence of program symbols or debugging information. As long as the functions to be used are identified, stripped binaries can be processed without difficulty. Our analysis is also platform-neutral; we evaluate Three cross-architecture settings for x86 64-bit, x86 32-bit and ARM architectures. Different compilers (gcc and clang), optimization levels and common obfuscation methods are also evaluated.

Baseline

The baseline approach is to perform standard in-procedural analysis starting at the function's entry point, and iterate through each execution path. Free symbols are created whenever loads are made from unknown data, including function parameters, global data, and other memory regions. Then , we collect output symbolic formulas for CPU registers and memory at the exit points of the path to construct input-output relations.

Generating Subgraphs From CFG

Considering the difficulty of fully exploring the CFG of a function, we first extract a subgraph. The subgraph should preserve the representative features of the corresponding CFG and reasonably reduce the analysis complexity. BinUSE is designed to traverse the entry points from each assembly function Each execution path begins. When a loop is encountered, as a common solution, we unroll the loop. When analyzing an execution path, BinUSE recursively inlines each callsite on the path. BinUSE Stop at an external call site. Consistent with standard SE, we create free symbols to represent values ​​stored in registers or memory locations. When we encounter an external call site, we collect symbolic formulas for the input of each function call, to Forms the "output" of this path. We also record path constraints as prerequisites for reaching external call sites.

Comparing Two Subgraphs

insert image description here

The figure above shows comparing two assembly functions ft, fs f_t, f_sft,fssubgraph G t , G s G_t, G_sGt,Gsprocess. Iteratively compares each call site until we find that G t G_tGtExternal call sites in G s G_sGsPairwise equivalent permutations of external call sites in . Note that only G t G_t is allowedGtA subset of call sites in G s G_sGsAnother subset of call sites in . The reason is that compiler optimizations can sometimes eliminate C library function calls, so allowing one subset of library calls to match another subset of library calls does not hinder G t G_tGtwith highly optimized G s G_sGsmatch. Conversely, if no permutation is found, the two functions are considered equal. Although in G t G_tGtand G s G_sGsComparing callsites in pairs can introduce many permutations, but heavyweight equivalence checks are only performed when two callsites point to the same external function.

C library call substitution When some of the input parameters of a C library call are constant, the compiler may replace this C library call with another. Also the compiler may occasionally replace a common C library call with a safer version. For example by using __printf_chkInstead printf, a stack overflow is detected before the result is computed.

Library callsites matching is considered a critical step in equivalence checking, for which we manually collected the following lists. BinUSE treats the library calls in each list as identical. For example, in addition to comparing two printf callsites, we consider printf callsiteand __printf_chkis the same. This list contains all possible C library replacements found in our test cases, which include Linux coreutilsand binutilstest suites, and a CVE database, which includes vulnerabilities from complex software such as OpenSSL, Wireshark, bash, and ffmpeg function.
insert image description here

Calculate the quantitative matching score assuming G t G_tGtThere are n paths in (each path ends in an external call site), and p paths are determined whose external call sites are in G s G_sGsA semantically equivalent call site in , indicating that the assembler function ft f_tftand fs f_sfsA score s of confidence considered equivalent is calculated as p/n. This confidence score will be used when calibrating the results of the DNN model. When analyzing the coreutils executable, 87.9% of the cases had a confidence score of 1.0, which is shows that in most cases, G f G_fGfAll paths on G s G_sGsmatch.

Comparison of Two Callsites

Although the previous research has proved that IDA-Pro has insufficient support for function information recovery [43], in IDA-Pro's FLIRT database [44], the standard C library function information has been well maintained. Therefore, using IDA- Pro (version 7.3) largely guarantees reliable analysis of external call sites. Nevertheless, if the executable contains some user-defined (or third-party-defined) library calls, FLIRT cannot handle them. Restoring Such information needs to infer the number of function parameters; the latest progress of function prototype information recovery can refer to [43], [45], [46]. Given a call site containing N parameters, we according to the calling convention on the corresponding architecture Extracts N symbolic formulas.

Call parameter permutation To check two call sites, we search for permutations of function parameters that match pairwise. Instead of simply comparing the i-th parameter of two call sites, we adopt a more conservative design and compare the matching permutations between parameters. For example In the figure below, filenamet filename_tfilenametcan and filenames filename_sfilenamesMatch, also can be with modes mode_smodes; m o d e t mode_t modetcan and filenames filename_sfilenamesMatch, also can be with modes mode_smodesMatch. This design makes equivalence checking more conservative, robust to compiler optimizations and potential confusion, but may introduce FPs.

insert image description here

In addition, functions with different numbers of parameters may also be equivalent, for example, gettextwhen dgettextthe latter domainnameis a constant value, the two are equivalent.
insert image description here

To address this hindrance, BinUSE is designed to be robust to a wide variety of real-world software. Empirical observations show that this permutation approach does not incur excessive overhead, since almost all commonly used software has a limited number of function parameters . To further justify this design decision, we present in Table 2 the distribution of the number of parameters for all external call sites encountered in the test cases. From these empirical results it can be observed that most external call sites have less than or equal to 2 parameters, almost all external call sites have less than or equal to 5 parameters. Therefore, we believe that our design decision for parameter arrangement does not add significant cost, but can help the overall design of BinUSE to be more conservative and robust.
insert image description here

Parameter equivalence check Let PC t , PC s PC_t, PC_sPCt,PCsThey are the path constraints from the entry point of the target assembly function to the call point, Argti , A rgsj Arg^i_t, Arg^j_sArgti,Argsjare the i-th parameter of function t and the j-th parameter of function s, respectively. Formally, check
insert image description here

where X = [ x 0 , x 1 , . . . , xm ] X = [x_0, x_1, ..., x_m]X=[x0,x1,...,xm] represents the symbol list used by the parameters and path constraints of the function t,Y = [ y 0 , y 1 , . . . , yn ] Y = [y_0, y_1, ..., y_n]Y=[y0,y1,...,yn] is the symbol list used by the parameters and path constraints of the function s. The following figure is to check whether there is a sortπ ( Y ) \pi(Y)π(Y) P C t ( X ) ∧ P C s ( Y ) PC_t(X) \wedge PC_s(Y) PCt(X)PCsMatching X under the condition of ( Y ) makes the above constraints unsatisfiable. If there is no such arrangement, thenA rgti , A rgsj Arg^i_t, Arg^j_sArgti,Argsjto be equivalent.

insert image description here

This design makes BinUSE more robust and reliable (less FNs) even in the face of cross-architecture, cross-compiler, confusion, etc. Although this will actually increase FPs, it can be used as a trade-off to increase the FP/FN ratio. Except In addition to permutation, in this step, the constants representing memory addresses in the symbolic formula are also normalized. We adopt a direct method to determine the constants representing memory addresses. First, when executing USE path by path, as long as the observation A constant is used to construct the base address of the code pointer, which we mark as a memory address. Second, we make an assumption shared by many advanced static disassemblers [47], [48], [49], such that A constant will be treated as a memory address if it points to a data or text section of an ELF-format executable.
When checking the above constraints, we set the timeout of the used solver Z3 [50] to N seconds. If the SMT solver yields an unsat, or it fails to find a sat solution within N seconds, the two function call arguments are considered equivalent. For our current implementation, N is set to 15 seconds. Setting a timeout may result in unequal Parameters are considered equivalent (i.e. FPs). However, setting a timeout does not introduce FN, which can speed up the analysis of large binary samples.

Optimization

While the traversal strategies presented in this section can cover most practical cases, there may be some corner cases that hinder our analysis. In particular, there may be no external call sites on the execution path. In this case, we do not simply Skip this path analysis, and instead collect the path constraints PC from the function entry point until reaching the return instruction (ret) at the end of the execution path. However, if no path conditions can be constructed while traversing this path, skip comparing this Path. In general, each return instruction ret will be treated as a special "external call site" with no parameters. To determine whether two paths without external call sites can be matched, we use the following constraints to check that they are related The path constraint PC t PC_tPCtand PC s PC_sPCs.

insert image description here

This optimization will extend the subgraph generated during symbolic traversal with additional nodes representing path constraints collected from execution paths with no external call sites. Then, for comparison by the assembly function ft f_tftand fs f_sfsThe derived two formed subgraphs G t G_tGtand G s G_sGs, we still cross-compare G t G_tGtand G s G_sGsNodes representing external call sites on , also cross-comparison G t G_tGtand G s G_sGsA node on a path representing no external call site.

insert image description here

The top-1 accuracies of the four evaluated DNN model enhancements are compared in Table 3. After using the optimizations proposed in this section, BinUSE can further improve the accuracy of the DNN models by about 2.45% on average.

BOOSTING DNN-BASED TOOLS

In simple terms, when BinUSE and DNN-based models reach a consensus: fs ∈ P dnn ∧ fs ∈ P use f_s \in P_{dnn} \wedge f_s \in P_{use}fsPdnnfsPuse, then consider fs f_sfsSimilar to the objective function. Algorithm 1 gives the enhanced algorithm, P is the final prediction result, and the elements are composed of two tuples, the first is the confidence score of BinUSE, the second is the confidence score of DNN, sorted according to the highest value of BinUSE Sort in descending order by weight. Finally, return the top-k function search results.

insert image description here

Where the threshold α \alphaα is selected by experiment,α \alphaIf α is too high, you will over-believe that BinUSE will lead to more FPs. If it is too low, the optimization effect will be insufficient. The best effect is at 0.41, so chooseα = 0.41 \alpha = 0.41a=0.41
insert image description here

insert image description here

experiment

Implementation

Based on the binary analysis framework angr[51], we implemented BinUSE with about 5500 lines of Python code. By connecting with the popular angr ecosystem and upgrading the assembly code to the platform-neutral VEX intermediate language, BinUSE can handle executable. What's more, a rich set of analysis tools (e.g., symbolic execution) is already provided in angr, so it saves the effort of building it from scratch.

Evaluation Setup

Datasets and Compilation Settings BinUSE was evaluated using the Linux coreutils dataset. The Coreutils dataset contains 106 programs. The programs were compiled with seven different settings (see Table 5). We used gcc 7.5.0 and clang 4.0.1 to compile the programs . We compile programs with no optimization (-O0) and maximum optimization (-O3). To facilitate cross-architecture comparisons, we compile binaries on three different architectures, 32-bit x86, 64-bit x86 and ARM. A coreutils executable compiled with the -O3 option has an average of 103.7 functions. In other words, given a pair of coreutils executables, BinUSE needs to cross-compare 103.7 * 103.7 assembly functions. In addition, BinUSE is benchmarked using the Linux binutils dataset Test. The Binutils dataset contains 112 programs. Each binutils executable has an average of 1765.0 functions.

insert image description here

We use BinUSE to enhance four cutting-edge DNA-based binary code function search tools: BinaryAI [15], asm2vec [14], PalmTree [18] and ncc [13].
A note on the selection of training datasets. We use plain binary code to train these DNN models, and the trained DNN models are evaluated based on their robustness in cross-compiler, cross-optimization, cross-architecture, and obfuscation settings.

BinUSE Performance

Table 7 reports the performance of BinUSE on a total of 12 comparison settings on the coreutils dataset. Most comparisons require challenging cross-compiler, cross-optimization, and cross-architecture settings. For example, the last comparison in Table 7 shows a Very difficult setup, it's cross-architecture (ARM vs x86 64bit), cross-compiler (gcc vs clang), cross-optimized (-O0 vs -O3), also applies control flow flattening obfuscation (-fla), it Extensive changes to the control flow structure.

insert image description here

Table 7 shows that BinUSE fails detection of more functions compiled with control flow flattening obfuscation-fla. This obfuscation converts CFG to C switch statement and stitches together basic blocks with dispatch nodes. Code pointers are often in dispatcher Used in nodes to direct control transfers, which leads to a higher chance of failure when materializing symbolic code pointers.

Processing Time

Figure 10 shows the breakdown of processing time for gcc -O0 compared to gcc -O3. We report the processing time for executing symbolic execution and solving constraints in Figures 10a and 10b, respectively. is linear in the size of . This is intuitive: large executables have more features, thus prolonging BinUSE's symbolic execution time. Similarly, large executables may contain more complex symbolic constraints, prolonging the resolution of The time required for symbolic constraints. However, it can be seen that most of the binary code samples can analyze symbolic execution in 2000 CPU seconds and constraint solving in 4000 CPU seconds.

insert image description here

DNN Model Comparison

Four dnn-based binary function search tools BinaryAI, asm2vec, ncc, and PalmTree were run on the 12 comparison settings. PalmTree cannot handle executables on the ARM platform; therefore, we skipped the corresponding calculations. Table 8 summarizes the performance Results. BinaryAI outperforms all models in all different settings. Although cross-architecture settings pose significant challenges, BinaryAI appears to be more robust to cross-architecture changes due to its platform-neutral micro Learning in code. Obfuscation, especially control flow flattening (-fla), majorly and consistently undermines top-1 accuracy. We encountered a lot of reverse engineering when lifting binary code to LLVM IR as input to ncc Engineering bug. The binary booster RetDec throws an exception when processing certain binaries. For this case, only the top-1 of successfully processed binaries are measured (binaries compiled with clang -O3 have about 40% remaining cases ). We were unable to recover the high accuracy reported in the asm2vec paper: we emphasize that both the software engineering and security communities have pointed out similar problems. Our evaluation shows asm2vec's top-1 accuracy of 38.3%, although lower than the accuracy reported in its paper rate, but it is highly consistent with the results of recent studies [16], [68], [69].

insert image description here

DNN Model Enhancement

To measure the enhancement of DNN-based methods using BinUSE, we try to answer two questions: 1) RQ1: Can BinUSE enhance different DNN-based binary code function search tools? and 2) RQ2: Can BinUSE enhance BinaryAI in different settings ? For RQ2, we targeted BinaryAI because it significantly outperformed the other three models. Furthermore, in addition to enhancing dnn-based methods, we also explored RQ3: Is BinaryAI general enough to enhance program-based structure-level Conventional binary differencing tools for information? For RQ3, we tested a popular binary differencing tool, FuncSimSearch[70], developed and maintained by Google's Project Zero.

insert image description here

RQ1. Table 9 presents the evaluation results under different settings. We consistently measure top-1, top-3 and top-5 enhancements. Table 9 shows that using BinUSE can significantly improve all DNN-based methods. DNN models usually start from coarse-grained , which are not resilient to various challenging settings and thus generate very high false alarms. BinUSE aims to address their key limitations in a consistent manner.

insert image description here

RQ2. To answer RQ2, we investigate three key hyperparameters for representation learning related to embedding vector dimensions. In general, different dimensions mainly affect model accuracy: longer embeddings can convey subtle information about the input data, Smaller embeddings may not be able to represent semantics well. However, longer vectors mean that model training faces more challenges and may potentially destroy the robustness of the model. BinaryAI contains three embedding dimensions related hyperparameters, respectively token embedding dimension, constant embedding dimension and graph embedding dimension. Table 10 reports that for all hyperparameters, despite different embedding dimensions, BinUSE consistently improves accuracy. Overall, this evaluation reveals that An intuitive observation is made: despite the changes in the model settings, the equivalence check is always able to resolve high false alarms, another important aspect that shows the generalization of BinUSE.

insert image description here

RQ3. This study mainly focuses on DNN-based binary code function search, because DNN-based methods exhibit high accuracy and largely outperform traditional program structure-based algorithms such as graph isomorphism [71] However, it is easy to see that BinUSE is not limited to enhancing dnn-based methods. In principle, we argue that binary differencing based on program structure often produces low discriminability and low robustness predictions. Our argument is empirically validated by an advanced binary differencing tool, FuncSimSearch, which computes Simhash scores on control flow graphs to efficiently determine the distance of assembly functions. The evaluation results are shown in Table 11. Compared with contemporary dnn-based methods Compared with FuncSimSearch, the results of FuncSimSearch are much worse. Therefore, BinUSE can improve the accuracy of all evaluation settings to a large extent. The relatively low accuracy of FuncSimSearch is also pointed out in the asm2vec paper. Overall,
the Evaluations consistently show that BinUSE can enhance the high false alarms produced by popular structure-based (DNN) tools in a general, efficient, and unique way. Therefore, we advocate combining binary code search with BinUSE to achieve synergistic effects in production use.

Vulnerable Function Searching

A case study is initiated by applying BinUSE to extend the vulnerability search task to a public vulnerability dataset. This application simulates a common security usage scenario: given an assembly function f from a suspicious executable fragment, we Search the function database D with known vulnerabilities, and determine whether f can be matched with any function in D. In this step, we use four obfuscation settings -sub, -bcf, -fla and -hybrid, respectively, to convert f in D to The examples are compiled into Dasm, a database of assembler functions. Note that the hybrid setting (called -hybrid) combines all three obfuscation methods during compilation. We also enable full optimization -O3 when compiling each example program and target function . In short, given a highly optimized (-O3) assembly function with a known vulnerability, we retrieve its matching function from Dasm and check if its correct match, function with the same vulnerability exists among the top 1 candidates in the function.

In this step, we measure asm2vec, BinaryAI, and two versions of PalmTree. We neglect to evaluate ncc because we find that the binary booster it employs fails in too many cases when dealing with these real-world complex software. We report in Table 12, The evaluation results for each setting are reported in 13, 14, and 15. asm2vec seems to struggle against OpenSSL and Wireshark because both programs are very complex. For the three versions of OpenSSL and Wireshark, asm2vec ranks lower for the true match Much more. For example, asm2vec ranks the Heartbleed vulnerability in OpenSSL (ver) as a true match. This means that a user may need to manually compare at least 17 copies of the program in Dasm to confirm the Heartbleed vulnerability in a suspicious input. In contrast , BinUSE can successfully match suspicious input to the top-1 Heartbleed vulnerability in Dasm. Asm2vec is also much less accurate when analyzing another notorious CVE ws-snmp. We found that the vulnerability contains a large CFG, which may hinder asm2vec's graph-level embedding calculation based on random walk. In summary, with the help of BinUSE, asm2vec can put the real fragile function in top-1.

insert image description here

insert image description here

insert image description here

In short, the evaluations in this section reveal very encouraging results when using BinUSE to analyze real applications for security purposes. The calculations in this section demonstrate the effectiveness of considering fine-grained call-site information when performing function matching. necessity.

Extension of BinUSE

ft f_tftwith each function f ∈ RP f \in RPfRP is still expensive to compare. In this section, we will investigate possible extensions of BinUSE; our goal is to reduce the cost by preferentially comparing the top-k functions returned by the DNN model. For example k=100, once the DNN model Determined with the objective functionft f_tftMatching top 100 functions, BinUSE compares these 100 ranking functions with ft f_tftMake comparisons, reorder them and adjust their rankings. In this way, BinUSE's comparisons are reduced from the size of RPs to only 100, reducing the cost when analyzing dual utils programs. However, since BinUSE only accesses and reorders the DNN model rankings For the first 100 assembly functions, the enhanced top-k accuracy (where k=100) is limited by the top 100 accuracy of the DNN model. In other words, if the target DNN model is less accurate, even top-100, the chance of enhancement Also very small.

insert image description here

The enhancements to BinaryAI and PalmTree are reported in Table 17. In this table, we evaluate 12 comparison settings. Compared to BinaryAI, BinUSE enhances PalmTree to a higher degree. This is mainly due to PalmTree's performance on the dual-utils test case The accuracy of is relatively low, leaving more opportunities for enhancement. On the other hand, compared with the evaluation on the coreutils dataset, the enhancement effect of BinUSE is low. In addition to the general difficulty of analyzing the binutils function, for this time evaluation, BinUSE only analyzes the first 100 functions returned by dnn-based tools. According to our observations, some real matches are not even within the first 100 functions. To further explore higher degrees of accuracy enhancement, users may consider leveraging the The top-150 or top-200 functions returned by dnn's tools.

insert image description here

Summarize

References

[6] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2Vec: Learning distributed representations of code,” in Proc. ACM Program. Lang., 2019, pp. 1–29.
[7] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” 2018, arXiv:1808.01400.
[8] F. Zuo, X. Li, P. Young, L. Luo, Q. Zeng, and Z. Zhang, “Neural machine translation inspired binary code similarity comparison beyond function pairs,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2019.
[9] S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla , “discovRE: Efficient cross-architecture identification of bugs in binary code,” in Netw. Distrib. Syst. Secur. Symp., 2016.
[10] J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary,” in Proc. 33rd ACM/IEEE Int. Conf. Automated Softw. Eng., 2018, pp. 896–899.
[11] S. Luan, D. Yang, K. Sen, and S. Chandra, “Aroma: Code recommendation via structural code search,” , 2018. [Online]. Available: http://arxiv.org/abs/1812.01158
[12] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2017, pp. 363–376.
[13] T. Ben-Nun , A. S. Jakobovits, and T. Hoefler, “Neural code comprehension: A learnable representation of code semantics,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 3589–3601.
[14] S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in Proc. IEEE Symp. Secur. Privacy, 2019, pp. 472–489.
[15] Z. Yu, R. Cao, Q. Tang, S. Nie, J. Huang, and S. Wu, “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 1145–1152.
[16] Y. Duan, X. Li, J. Wang, and H. Yin, “DEEPBINDIFF: Learning program-wide code representations for binary diffing,” in Proc. 27th Annu. Netw. Distrib. Syst. Secur. Symp., 2020.
[17] B. Liu et al., “diff: Cross-version binary code similarity detection with DNN,” in Proc. 33rd IEEE/ACM Int. Conf. Automated Softw. Eng., 2018, pp. 667–678.
[18] X. Li, Q. Yu, and H. Yin, “PalmTree: Learning an assembly language model for instruction embedding,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2021, pp. 3236–3251
[35] C. A. R. Hoare, “How did software get so reliable without proof?,” in Proc. Int. Symp. Formal Methods Eur., 1996, pp. 1–17.
[36] E. Gunnerson, “Defensive programming,” in A Programmer’s Introduction to C#, New York, NY, USA: Apress, 2001.
[37] M. Stueben, “Defensive programming,” in Good Habits for Great Coding, New York, NY, USA: Apress, 2018.
[43] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley, “ByteWeight: Learning to recognize functions in binary code,” in Proc. 23rd USENIX Secur. Symp., 2014, pp. 845–860.
[44] Fast Library Identification and Recognition Technology, 2021. [Online]. Available: https://www.hex-rays.com/products/ida/tech/flirt/
[45] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in Proc. 24th USENIX Conf. Secur. Symp., 2015, pp. 611–626.
[46] Y. Lin and D. Gao, “When function signature recovery meets compiler optimization,” in Proc. IEEE Symp. Secur. Privacy, 2021, pp. 36–52.
[51] Y. Shoshitaishvili et al., “SOK: (State of) the art of war: Offensive techniques in binary analysis,” in Proc. IEEE Symp. Secur. Privacy, 2016, pp. 138–157.
[68] Y. Hu, H. Wang, Y. Zhang, B. Li, and D. Gu, “A semantics-based hybrid approach on binary code similarity comparison,” IEEE Trans. Softw. Eng., vol. 47, no. 6, pp. 1241–1258, Jun. 2021.
[69] J. Jiang et al., “Similarity of binaries across optimization levels and obfuscation,” in Proc. Eur. Symp. Res. Comput. Secur., 2020, pp. 295–315.
[70] FunctionSimSearch, 2021. [Online]. Available: https://github.com/googleprojectzero/functionsimsearch
[71] H. Flake, “Structural comparison of executable objects,” in Proc. Int. GI Workshop Detection Intrusions Malware Vulnerability Assessment, 2004, pp. 161–174.

Insights

Authors
(1) Enhance the training effect of the DNN model through different optimization options and confusing binary programs (similar to the paper on unleashing power
(2) The reproduction effect of ncc is much lower than that of the paper
(3) Fine-grainedness should be considered when performing function matching Call point information
(4) The latest advances in explainable artificial intelligence (XAI) technology have been able to identify the most influential codes in DNN model decision-making. Therefore, XAI technology is used to mark the influential code fragments c1 and c2, which are mainly responsible for the DNN model The decision to match assembly functions f1 and f2. BinUSE can then be launched on the marked c1 and c2 to check their semantic equivalence. Influential code fragments can reduce the inspection cost, but the challenge is how to distinguish the influential code fragments Boundaries and path constraints for maintaining code fragments.

Mine
(1) For single-architecture work experiments, cross-architecture tests can be abandoned. For example, palmtree only conducts experiments on x86
(2) Examples of retdec upgrade failures can be discarded, and only successful samples are kept for experiments
(3) Use self-attention Mechanism to identify high-impact code fragments to automatically adjust computational weights for code embeddings?

Guess you like

Origin blog.csdn.net/qq_33976344/article/details/129052358