[论文分享] VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search

VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search [NDSS 2023]

Zhenhao Luo, Pengfei Wang, Baosheng Wang, Yong Tang, Wei Xie, Xu Zhou, Danjun Liu and Kai Lu
College of Computer, National University of Defense Technology

Code reuse is very common in software development. The loopholes it brings spread in large numbers, threatening software security. Unfortunately, as the Internet of Things (IoT) grows and is deployed, the dangers of code reuse are magnified. Binary code searching is a viable way to discover these hidden vulnerabilities. In the face of IoT firmware images compiled with different architectures, different compilers, and different optimization levels, existing methods are difficult to adapt to these complex scenarios. In this paper, we propose a new intermediate representation function model, which is an architecture-indeterminate model for cross-architecture binary code search. It promotes binary code to microcode and preserves the main semantics of binary functions by supplementing implicit operands and removing redundant instructions. We then generate function embeddings using natural language processing techniques and graph convolutional networks. We refer to the combination of compiler, architecture, and optimization layer as the file environment, and adopt a divide and conquer strategy to divide CN 2 C_N^2CN2The problem of similarity computation across file environment scenarios is divided into N−1 embedding transfer sub-problems. We propose an entropy-based adapter to transfer function embeddings in different file environments to the same file environment to alleviate the differences caused by different file environments. To precisely identify fragile functions, we propose an incremental search strategy to supplement function embeddings with fine-grained features to reduce false positives caused by patched functions. We implemented a prototype named VulHawk and conducted experiments under seven different tasks to evaluate its performance and robustness. Experiments show that VulHawk outperforms Asm2Vec, Asteria, BinDiff, GMN, PalmTree, SAFE, and Trex.

In a word - VulHawk: Use the entropy adapter to transfer different file environments to the same environment for binary function embedding, and use the progressive search strategy of fine-grained features to improve the embedding effect, and realize the cross-architecture binary vulnerability function search.

introduction

Motivation

Code reuse is very common in software development. However, a large amount of code and libraries are reused into multiple architecture binaries without security audits, which leads to many hidden vulnerabilities in software projects. Synopsys audited 2,409 projects in 2021 and reported that 97% of projects contained third-party code, and 81% of them contained known vulnerabilities [49]. A single vulnerability in open source code can spread to thousands of pieces of software, exposing millions of people to serious software security threats. Unfortunately, as the Internet of Things (IoT) grows and is deployed, the dangers of code reuse are magnified. IoT devices are widely used in various scenarios. For different usage requirements, these IoT firmware images from different instruction set architectures (isas) are generated by different compilers with different optimization levels. However, many IoT firmware images only provide binaries without source code for security analysis. Their symbolic information, such as function names, is usually stripped. Hence, binary code searching becomes an active research focus for finding vulnerabilities hidden in IoT devices.
Binary code search is used to find similar or homologous binary functions in large function libraries. Widely used in vulnerability detection [5]-[7], [16], [29], [41], [43], [48], [58]. For example, given a binary, binary code search compares its functions with all functions in the vulnerable library, finding vulnerable functions in the binary based on the function similarity. Additionally, it is used for malware analysis and binary patch analysis. Since IoT firmware images come from different compilers, optimization levels, and instruction sets, this poses a serious challenge to binary code search, which requires a high robustness of the search method.

Finding vulnerabilities in IoT firmware requires robust binary code search methods across ISAs. In single-architecture binary code search, Asm2Vec [10], DeepBinDiff [11] and PalmTree [27] achieved encouraging results using natural language processing (NLP) techniques. However, they can only search binaries on the same ISA and do not support cross-architecture tasks. InnerEye [60] treats binaries from different ISAs as different natural languages ​​and uses neural machine translation to calculate the similarity of binary codes. SAFE [35] uses binaries from multiple ISAs to train its language model to search binary code across architectures. These methods rely heavily on training data, and it is difficult to implement multiple ISAs. Lifting architecture-specific binary code to an architecture-independent intermediate representation (IR) is an effective way to address cross-architecture challenges in IoT firmware. However, natural language and IR are fundamentally different. Unlike natural language, IR includes EFLAGS as an implicit operand (eg, ZF). These flags control the execution path of a function and have important implications for function semantics. In addition, a large number of redundant instructions in the extraction reduce the weight of the main semantics and affect the extraction accuracy of the main semantics.

We consider 3 architectures (x86, arm and mips), 2 word sizes (32 and 64 bits), 2 compilers (Clang and GCC), 6 optimization levels (O0, O1, O2, O3, Os
, Ofast), a total of 72 combinations (3 × 2 × 2 × 6). If binary is chosen from any two of the above combinations, there are a total of 2556 combination scenarios ( C 72 2 C_{72}^2C722). Existing [9], [10], [27], [41] methods pin their hopes on deep learning to alleviate these differences and build a robust model for these scenarios. It is possible to build robust models for one or a few specific scenarios. However, building a robust model for these 2556 scenes is complicated. Also, there is no information directly indicating the compiler and optimization level in the binary.

To address the above issues, this paper proposes a novel cross-architecture binary code search method, VulHawk. It incorporates a new intermediate representation function model (IRFM) to generate robust function embeddings. In IRFM, we first promote binary code to microcode. We then treat microcode sequences as languages ​​and use a variant of the RoBERTa model [31] to build basic block embeddings. We employ graph convolutional networks (GCNs) to integrate basic block embeddings and control flow graphs (CFGs) to generate function embeddings.
For the cross-architecture challenge in P1, microcode is an architecture-independent language that allows our model to train from one ISA and search for functions in multiple ISAs. For redundant instructions and implicit operands in P1, we implemented instruction simplification in IRFM. We treat the assignment of implicit operands (EFLAGS) as real assignment instructions, which can help IRFM supplement implicit operand semantics. For redundant instructions, instruction simplification is to simplify the microcode based on the defuse relationship, which not only removes redundant instructions, but also retains the main semantics of binary functions. This helps IRFM to extract function semantics more precisely. We also propose Root Operand Prediction (ROP) and Adjacent Block Prediction (ABP) pre-training tasks to help the model understand the relationship between operands and the data flow relationship between basic blocks.
For the challenge in P2, we adopt a divide-and-conquer strategy and divide the similarity calculation problem of 2556 scenes into 71 embedding transfer problems. We refer to the combination of compiler, architecture, and optimization level as a file environment. Faced with 72 file environments, we choose an intermediate file environment to embed and pass functions from different file environments into the same file environment to alleviate differences. First, we introduce Shannon entropy [47] from the perspective of information theory to represent the amount of information in a binary file. In practice, we find that binaries from the same file environment have similar entropy distributions. So we use entropy flow to identify the file environment. By knowing the file environment of functions, we deploy an entropy-based adapter to embed and transfer these functions into the intermediate file environment to alleviate the differences caused by different environments. Based on this, we propose an incremental search strategy to search candidate functions to keep the retrieval efficiency and precision high. First, it uses function embeddings to retrieve top-K candidate functions based on Euclidean distance. Then, a similarity calibration method is proposed to supplement the function embedding with fine-grained features to reduce false positives.

Contributions

(1) We propose an IRFM to generate robust function embeddings across architectures. It promotes binary code to microcode and preserves the main semantics of binary functions by simplifying instructions. Two pre-training tasks are proposed to help our model learn the root semantics of operands and grasp block data flow relations. We use GCNs to integrate CFG-based basic block embeddings to generate function embeddings
(2). Following a divide-and-conquer strategy, we use entropy flow from an information-theoretic perspective to identify the file environment of a binary file. We propose an entropy-based adapter to transfer function embeddings into the same file environment to alleviate differences caused by different file environments
(3) propose a progressive search strategy that uses fine-grained features to achieve similarity Calibration to improve performance and reduce false positives caused by patching functions
(4) implemented VulHawk and evaluated it in three different scenarios: one-to-one comparison, one-to-many search, and many-to-many matching, which are Across compilers, optimization levels, and architectures. Experiments show that the performance of VulHawk is better than the state-of-the-art method
(5) For the convenience of follow-up research, the program and pre-training model of VulHawk are released https://github.com/RazorMegrez/VulHawk

Background

Binary Similarity Analysis

Cross-architecture binary code search aims to retrieve semantically similar candidate functions for a large number of binary functions extracted from various IoT devices [55]. Inspired by existing work [10], [55], [57], we define two binary functions that are semantically similar if they are compiled from the same or logically similar source code.
Like binary code similarity detection, the core of binary code search is to design a robust model to detect whether a given function is similar. Instead of one-to-one matching, binary code search considers one-to-many search , which requires methods that can retrieve semantically similar candidates faster and more accurately. In the real world, IoT firmware can be compiled by various compilers (e.g. GCC and Clang) with different optimization levels (e.g. -O3, -Os), which results in compiled binary functions with the same semantics but different structures. Therefore, an effective cross-architecture binary code search needs to achieve the following goals:
(1) support cross-architecture
(2) support cross-compiler
(3) support cross-compilation options
(4) high precision and high efficiency

entropy theory

Shannon's entropy theory [47], where S is the set of elements, and p(x) represents the probability of occurrence of element x. Figure 1 gives an example of Shannon entropy calculation. Each pattern represents a different element.
insert image description here

Box 1 is full of circles with entropy H of 0; box 2 has circles and triangles, which are more complex than box 1, and its entropy is also larger than box 1; box 3 is the most complex system among the three systems, and its entropy H is the highest.
insert image description here

Through entropy analysis, we can have a prior understanding of the average amount of information in a system before delving into it. In the binary code search task, we obtain the information distribution of binary files through binary file entropy, so as to infer information such as their compiler and optimization level. This helps our model choose appropriate parameters for different binary inputs.

plan

Overview

Vulhawk consists of three components: an intermediate representation function model, an entropy-based adapter, and an incremental search strategy. Figure 2 shows the general framework

insert image description here
IRFM is used to generate basic block embeddings and function embeddings. We first promote the binary code to microcode. Then, instruction simplification supplements the implicit operand semantics and prunes redundant instructions, which not only preserves the main semantics of the function, but also improves the robustness of IRFM. Afterwards, we use a language model based on RoBERTa [31] to build basic block embeddings. In model training, we propose root operand prediction and adjacent block prediction pre-training tasks to let IRFM understand the relationship between operands and the data flow relationship between basic blocks. Finally, GCNs are employed to aggregate adjacent basic block embeddings to capture control flow relations and generate function embeddings.

The entropy-based adapter recognizes the file environment of the input binary and uses a divide-and-conquer strategy to embed functions from different file environments. Here, we introduce entropy from the perspective of information theory. We first use entropy to predict the file environment, and then use an entropy-based adapter to pass function embeddings into the intermediate file environment according to their file environment to mitigate the difference caused by the file environment.

An incremental search strategy is used to precisely detect candidates for function queries. We propose a two-step strategy consisting of coarse-grained search and similarity calibration. Through similarity calibration, we filter out false positives (such as patch functions) when detecting fragile functions, making our model more accurate.

Intermediate Representation Function Model

Intermediate Representation: For binaries from various architectures, disassemble them and raise the binary code to an architecture-independent IR. We used IDA Pro and its IR, named microcode, for our implementation, but other disassemblers and IRs will work as well. As shown in Table 1, the microcode group divides various instructions from different architectures into 73 opcodes and 16 operands. For example, mop_z means no operand, mop_r means register, and mop_str means string constant. Microcode is a mature IR that mitigates the impact of instruction type differences on cross-architecture binary code searches. For more details, refer to https://www.hex-rays.com/products/decompiler/manual/sdk/hexrays_8hpp_source.shtml

insert image description here

Tokenization
In microcode, an instruction consists of an opcode and an operand triplet, and the operand triplet includes the left operand, right operand, and target operand.
The base addresses and offsets of different binaries are different, which introduces noise and makes the model less robust. We normalize these addresses (eg 0x4040E0 and 0x4150D0 ) with a special token [addr]. In order to alleviate OOV tokens, 16 kinds of root operation tokens shown in table I will be used instead. For operands in the vocabulary, we use their own notation, and for those OOV operands, we use the root operand notation for their basic semantics. In the pre-training stage, tokens with frequency less than 100 times are replaced with their root-operand tokens to build root-operand token embeddings.

Token Type Layer
Unlike natural language, microcode consists of opcodes and operands, not just single words. The opcode indicates the operation to be performed (for example, ldx and goto), and the operand indicates the data or memory location used by the operation. Considering these differences, we use a token type layer to help IRFM distinguish between opcodes and operands. We classify tokens into three types: opcode, operand, and other. The others type contains some special tokens (such as [pad]) that have no actual semantics.

Instruction Simplification
This paper proposes an instruction simplification method based on def-use relations to prune redundant instructions and preserve important semantics. First, we mark the following "important" instructions to avoid deletion: (1) global and local variables are stored in memory rather than registers, so we mark assignment instructions whose target operand is a memory address, such as in Figure 3(a) Ln.18. (2) The return value is usually stored in a specific register, such as rax (x86) and x0-x1 (arm). Therefore, we tag specific registers according to the calling convention near the return instruction on all paths, such as Ln.19 in Fig. 3(a). (3) The parameters of the sub-function appear before the function call, and will not be overwritten by other instructions before being passed to the sub-function, such as Ln.2 in Figure 3(a). We use a loose rule for "important" directives to ensure that no major semantics are removed by mistake. We treat instructions with defined registers or eflags that are not used in subsequent instructions as unused instructions (e.g., Ln. 5, 12-15 in Figure 3(a)) and prune them. After pruning unused instructions, redundant instructions (eg, Ln. 5-8 in FIG. 3 ) are optimized.

insert image description here

We focus on instructions that define the direct assignment of a register to another variable, called pass-by-value instructions. Through instruction simplification, the 20 instructions in Figure 3(a) are simplified to 9 instructions, which retains the main semantics of Figure 3(a), and is similar to the pseudocode of Figure 3(a). This helps IRFM to extract more precise function semantics. In practical applications, the RoBERTa model accepts a limited input length, and instruction simplification enables the input of RoBERTa to retain more valid instructions.

Pre-training tasks: In the training phase, pre-training is performed using Masked Language Model (MLM), Root Operand Prediction (ROP) and Adjacent Block Prediction (ABP).
This paper proposes an MLM model to understand the relationship between microcodes and construct appropriate word embeddings. MLM was first proposed by BERT, which uses contextual tokens around masked tokens to predict masked tokens to optimize model parameters. Figure 4 shows an example, where yellow, red, and green boxes denote masked tokens, replaced tokens, and prediction results, respectively. In Figure 4, the opcode setz and register r0 are masked as [mask], and the immediate numeric constant #0 is replaced by its root operand token mop_n.

insert image description here

During training, we feed the final hidden states corresponding to masked/replaced tokens into an output softmax over the vocabulary to predict probabilities for these tokens. The loss function of MLM uses cross entropy loss as follows:

insert image description here

This paper proposes the ROP pre-training task to associate token semantics with their root token semantics, so that the model can generate more reliable root token semantics for unregistered words. In microcode, operands are divided into 16 types (see Table I). We use them as root operand tokens. It is friendly to OOV operands, because we can convert OOV operands to their root operand notation to represent their root semantics. For example, a specific address 0xdeadbeef, assuming it is an OOV operand, our model assigns it the semantics of the root token mop_a representing the address operand, while conventional models cannot distinguish its semantics.
Since the opcode-root token is itself an opcode token, the ROP task does not predict the root opcode token. We perform a ROP header to predict their root tags. During the training phase, we feed the final hidden state of the token into a linear transformation. We use output softmax on operand types to predict the probabilities of these rootoperand tokens. The loss function of ROP adopts the cross entropy loss as follows:

insert image description here

In binary functions, there are data flow relationships between basic blocks. Unlike natural languages, variables in binary code need to be defined before they can be used. Basic blocks with data flow relationships are order sensitive and IRFM cannot directly capture these relationships. To train a model that understands the data flow relationship between adjacent blocks, an ABP pre-training task is proposed. Specifically, given two basic blocks A and B, where B is the successor of A, a variable x is defined in block A, and variable x is used in block B. We label the order of AB as positive and the order of BA as negative. Note that A and B are not the same block, and A is not the successor block of B. Furthermore, we do not consider the case where A and B have only control-flow relationships but no data-flow relationships. Because if the data flow relationship is not supported, the reverse order of the blocks may also occur.
We input the final hidden state of token[cls] in IRFM to the ABP header, and perform a linear transformation to identify whether the input two microcode sequences are positive sequences. ABP's loss function uses cross-entropy loss as follows:

insert image description here

The total loss function of the language model is a combination of the above three loss functions:
insert image description here

The task of IRFM in VulHawk is to generate function embeddings. First, basic block embeddings are generated. For an input microcode block, the IRFM transformer encoder outputs a sequence of hidden states. This paper applies an average pooling layer to integrate microcode instruction embeddings. According to the pre-training model results, the hidden state of the last layer is too close to the target task (such as MLM) during the pre-training process, which may bias these pre-training tasks. The hidden state of the second layer has a stronger generalization ability than the hidden state of the last layer. Therefore, we use mean pooling on the hidden states of the second layer to generate basic block embeddings.
Existing studies [34], [55] have shown that CFGs-based solutions have advantages in cross-architecture scenarios. This paper integrates basic block embeddings and CFGs to generate function embeddings. Considering the multi-branch structure of binary functions, we use GCNs [24] to capture the CFG structure and cluster basic block semantics into adjacent basic blocks. Think of binary functions as property graphs whose basic blocks are the nodes in the graph and whose embeddings are the properties of the nodes. We feed attribute control flow graphs (ACFGs) into GCN layers. X(l) represents the characteristics of nodes in layer l, and the aggregation function is as follows:

insert image description here

Given two binary functions, we generate ground truth y, i.e. dissimilar (0) and similar (1), based on the function name and source file. We use the Euclidean distance to calculate the similarity s of two functions as follows:

insert image description here

The training goal is to make the similarity of similar functions approach 1, and the similarity of dissimilar functions approach 0. We use cross-entropy loss as the loss function:
insert image description here

Entropy-based Adapter

In the real world, binary functions are compiled by multiple compilers from different architectures with different optimization levels. In this article, we refer to the combination of compiler, architecture, and optimization level as a file environment. Functions from different file environments, even from the same source code, may have different instructions and structures.
Figure 5 shows an example of matching similar functions in the embedding space. Given an embedding space, points of the same color represent similar functions and their variants. Existing methods do not distinguish the document environment of the function, and build a model in the mixed document environment to generate the embedding of the binary function. The embedding of different functions has a good degree of discrimination in the same file environment, but mixed file environments may cause embedding conflicts, which greatly increases the complexity of binary code similarity search. Also, the differences between file environments are different. For example, the difference between O0 and O3 optimizations and the difference between GCC and Clang compilers are different. Constructing a single model that is highly robust to all file environments is a difficult task.

insert image description here

To address this challenge, this paper proposes a new divide-and-conquer strategy. First, divide the embedding space of the mixed file environment into multiple embedding subspaces; second, select one of the file environment V as the intermediate file environment, and divide C n 2 C^2_nCn2The function similarity problem between N file environments of a scene is divided into N−1 sub-function embedding transfer problems; finally, the function embeddings of different file environments are transferred to the same file environment V for similarity by using the trained adapter Calculation to alleviate the differences caused by different file environments.

Entropy-Based Binary Analysis: An important step in the divide-and-conquer strategy is to determine the file environment. The architecture and word size of a binary function can be determined by its instruction. The problem, however, is that there is no direct indication of the compiler and optimization level in the binary. To solve this problem, this paper understands binaries from the perspective of information theory, and introduces entropy to identify compilers and optimization levels of binaries. In general, compressed or encrypted code segments tend to have higher entropy than native code [33]. This can also be used to differentiate between different compilers and optimizations.
Figure 6 shows the entropy flow of 12 different binaries from 3 file environments. The entropy stream of a binary file is calculated by splitting the raw bytes into their hexadecimal representation (0x00-0xFF). It can be observed that the entropy flows from the same file environment look similar, while the entropy flows belonging to different file environments are different. Using entropy flow and entropy theory, we can identify different compilers and optimizations.

insert image description here

To prevent possible collision problems caused by single entropy flows, we use the following functions:

  • The entropy stream of the text segment contains 256 corresponding raw byte probabilities (0x00-0xFF).
  • The text segment entropy is the integral of the entropy flow to the text segment. This focuses on the executable portion of the binary, avoiding the impact of data segment changes.
  • File entropy is the integral of the entire file entropy stream, providing global information at the file level.

Figure 7 shows the structure of the basic residual block. It consists of batch normalization and linear transformation, and the activation function is ReLU. Adding skip connections using the identity map from the input to the output of the basic residual block preserves the functional semantics and helps to solve the vanishing gradient problem.

insert image description here

In order to calculate the similarity of binary functions in different file environments, this paper proposes an entropy-based adaptation layer after IRFM. The entropy-based adaptation layer is used as a mapping F to embed and transform functions in different file environments into the same intermediate file environment V to alleviate the differences caused by different file environments. The mapping F should both preserve function semantics and reduce bias due to different file environments.
To reduce training complexity, we freeze the parameters of IRFM. Take the function similarity as the truth value, namely dissimilar(0) and similar(1). The training goal is to make the similarity of similar functions approach 1, and the similarity of dissimilar functions approach 0. We use cross-entropy loss as the loss function:

insert image description here

Progressive Search Strategy

Existing methods exploit function embeddings to search for similar functions. This is a coarse-grained detection method that lacks fine-grained information (e.g., block-level features), which achieves low search overhead but leads to high false positives, especially for small patched changes The function. However, Marcelli et al. [34] used graph matching network and [58], [60] used Siamese network to calculate the similarity of each function pair at a fine-grained level. Although higher performance can be achieved, the computational cost is higher.
In the face of complex vulnerability detection scenarios, this paper proposes a new search strategy - progressive search strategy, which reduces the computational burden while maintaining good performance, and reduces false positives caused by patch functions in vulnerability detection. This strategy combines two sub-strategies. First, function embeddings are used as global summaries for coarse-grained search. Second, by performing pairwise similarity calibration on candidate functions, fine-grained information is supplemented for function embedding, which ensures high precision of vulnerability detection.
For high-precision binary code search, this paper proposes a similarity calibration method for fine-grained detection. Combining basic blocks, string constants, and imported function information to calculate pairwise similarity scores, extract vectors from them and combine with function-level information to improve vulnerability detection performance.

Block-Level Features Function-level embeddings may lose block-level features, such as block embedding distribution and function size. In many cases, the differences between functions exist in smaller substructures that are difficult to reflect through function embeddings. As an analogy, in graph matching, graph embedding based graph matching performance can be enhanced with fine-grained node-level information.
String Characteristics Since string constants and imported functions are the same or similar in similar function pairs, their similarity also plays a role in expressing the similarity of functions.
Imported function characteristics Use the Jaccard index to calculate the two imported function sets I 1 I_1I1Sum I 2 I_2I2The similarity si s_isi

The algorithm as a whole is as follows

insert image description here

After calculating the above three vectors, we compare these vectors with the similarity ss in the embedded search functions is connected to the vectorVVin V. Then, the vectorVVV is fed into the feed-forward network to learn the weights and predict the final function similaritys's's' . We use the cross-entropy loss function to optimize the network weights. Finally, we use the default thresholdhhh to filter out similar functions as the result.

experiment

set up

insert image description here

One-to-one Comparison

As shown, the AUC scores of VulHawk on balanced and unbalanced datasets outperform SAFE, Asteria, GMN, PalmTree, Asm2Vec, and Trex in all experimental settings. For example, in the cross-architecture (XA) experiment, VulHawk obtained an AUC of 0.998, among which Trex obtained an AUC of 0.947, Asteria obtained an AUC of 0.951, SAFE only had an AUC of 0.509, while PalmTree and Asm2Vec failed in the cross-architecture experiment.

insert image description here

One-to-many Search

We collect recall across different top-K results, and plot recall versus k in Figure 9. The results show that VulHawk outperforms the current state-of-the-art methods, achieving the best recall@1 0.935 in the XO task and 0.879 in the XC+XO+XA task. In the XC+XO+XA task, when the number of retrieved results exceeds 30, the recall rate of each method tends to be stable, among which VulHawk reaches recall@30 around 0.994, VulHawk-es reaches recall@30 around 0.968, and VulHawk - s reaches recall@30 0.988 or so, Trex reaches recall@30 0.888 or so, SAFE reaches recall@30 0.310 or so.

insert image description here

Many-to-many Matching

Among them, the baseline has the lowest results in O0-O3 in the XO experiments. VulHawk achieved a recall rate of 0.876 in this experiment, which is 385.9%, 292.9%, 208.6%, 240.0%, 211.6%, 621.3% higher than SAFE, Asteria, Asm2Vec, BinDiff, PalmTree, GMN, and Trex, respectively. % and 202.8%. Interestingly, VulHawk's worst result (0.805) was in the XC experiment, not in the O0-O3 experiment.

insert image description here
Figure 10 shows the distribution of recall and precision for the XC, XA, and XO tasks with violin plots, on which we annotate the mean results of each method. Compared with SAFE, BinDiff, Asteria, Asm2vec, PalmTree, GMN, Trex and other methods, the probability distribution of VulHawk's recall and precision is closer to 1 and more concentrated, while the result distribution of other methods in different scenarios is more scattered And unstable. This shows that VulHawk performs better and more stable than other baselines.

insert image description here

Runtime Efficiency

Table V shows the time cost of searching functions in repositories of different sizes and their throughput. The results show that VulHawk is slower than VulHawk-es and VulHawk-s due to the use of entropy-based adapters during embedding generation and similarity calibration during search.

insert image description here

Ablation Study

Entropy-based Adapter . As shown in Table III, in 7 one-to-one function comparison tasks, the AUC of VulHawkS is higher than that of VulHawk-ES.

Similarity Calibration . As can be seen from Table III and Figure 9, VulHawk achieves better results than VulHawk-s in one-to-one function comparison and one-to-many search scenarios.

Training Tasks . To evaluate the contribution of the three training tasks, VulHawk is also evaluated with different training settings. In order to more clearly measure the contribution of the training task, the evaluation model does not use the entropy-based adapter and similarity calibration, and only differs on the training task. In the XC+XO+XA task in the one-to-one comparison scenario, the AUC value of the model trained by the MLM task is 0.833, the AUC value of the model trained by the MLM+ROP task is 0.934, and the AUC value of the model trained by MLM+ROP+ABP The value is 0.966.

File Environment Identification

We also evaluate the accuracy of entropy-based file environment recognition. Here, we use 10-fold cross-validation to split all binary data for training and evaluation, just like a traditional machine learning setup. These binaries are from different architectures (x86, arm and mips) and different compilers (GCC and Clang). Pizzolotto et al. [44] used a CNN model and an LSTM model on function bytes to identify file environments. To better demonstrate the performance of the method, their pretrained models were downloaded and they (CNN and LSTM) were set up for comparison. Note that in practice the compiler and optimization level for a given binary are unknown, so while evaluating one parameter (eg compiler) for practicality we do not fix others (eg architecture and optimization level)

insert image description here

1-day Vulnerability Detection from Firmware

In this experiment, we collect 20 state-of-the-art IoT firmware images from three vendors (D-Link, TP-Link, and NetGear) and execute VulHawk and other baselines on a 1-day vulnerability detection task. The OpenSSL and Curl projects, which are widely used in IoT firmware, are selected as targets, and the vulnerability library is constructed based on the Common vulnerability and exposure (CVE) database. The library contains 12 related CVE vulnerability functions and their patch functions. The detailed information and ground truth are shown in Table 6. There are a total of 53,739 functions, of which 93 are related vulnerability functions and 119 are related patch functions. For each vulnerability/patched function, its function embedding is generated using VulHawk and its fine-grained features are recorded for similarity calibration.

insert image description here

Table VI presents the results of VulHawk and other baselines according to Fig. 13 for the best threshold. For 12 cves, Trex achieved 0 false positives but a recall of 52.7%; GMN achieved a recall of 86.0% but generated 36,342 false positives; Asteria achieved a precision of 64.1% with zero false negatives; and VulHawk achieves the best performance with zero false positives and 100% recall.

Summarize

References

[5] Y. David, N. Partush, and E. Yahav, “Statistical similarity of binaries,” in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 266–280.
[6] Y. David, N. Partush, and E. Yahav, “Firmup: Precise static detection of common vulnerabilities in firmware,” in ACM SIGPLAN Notices, vol. 53, no. 2. ACM New York, NY, USA, 2018, pp. 392–404.
[7] Y. David and E. Yahav, “Tracelet-based code search in executables,” Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 349–360, 2014.
[9] S. H. Ding, B. C. Fung, and P. Charland, “Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16. New York, New York, USA: ACM Press, 2016, pp. 461–470.
[10] S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 472–489.
[11] Y. Duan, X. Li, J. Wang, and H. Yin, “DeepBinDiff: Learning ProgramWide Code Representations for Binary Diffing,” in Proceedings of the 27rd Symposium on Network and Distributed System Security (NDSS), 2020.
[16] J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary,” ASE 2018 - Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 896–899, 2018.
[27] X. Li, Q. Yu, and H. Yin, “Palmtree: Learning an assembly language model for instruction embedding,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, pp. 3236–3251.
[29] J. Lin, D. Wang, R. Chang, L. Wu, Y. Zhou, and K. Ren, “Enbindiff: Identifying data-only patches for binaries,” IEEE Transactions on Dependable and Secure Computing, no. 01, pp. 1–1, 2021.
[31] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[33] R. Lyda and J. Hamrock, “Using entropy analysis to find encrypted and packed malware,” IEEE Security & Privacy, vol. 5, no. 2, pp. 40–45, 2007.
[34] A. Marcelli, M. Graziano, X. Ugarte-Pedrero, Y. Fratantonio, M. Mansouri, and D. Balzarotti, “How machine learning is solving the binary function similarity problem.”
[35] L. Massarelli, G. A. Di Luna, F. Petroni, R. Baldoni, and L. Querzoni, “Safe: Self-attentive function embeddings for binary similarity,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2019, pp. 309–329.
[41] K. Pei, Z. Xuan, J. Yang, S. Jana, and B. Ray, “Trex: Learning execution semantics from micro-traces for binary similarity,” arXiv preprint arXiv:2012.08680, 2020.
[43] J. Pewny, F. Schuster, L. Bernhard, T. Holz, and C. Rossow, “Leveraging semantic signatures for bug search in binary programs,” in Proceedings of the 30th Annual Computer Security Applications Conference, 2014, pp. 406–415.
[47] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
[48] P. Shirani, L. Collard, B. L. Agba, B. Lebel, M. Debbabi, L. Wang, and A. Hanna, “Binarm: Scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2018, pp. 114–138.
[49] Synopsys, “Open source security and risk analysis report,” https://www.synopsys.com/content/dam/synopsys/sig-assets/reports/rep-ossra-2022.pdf, 2022.
[55] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security - CCS ’17. Dallas,TX, USA: ACM Press, 2017, pp. 363–376.
[57] J. Yang, C. Fu, X.-Y. Liu, H. Yin, and P. Zhou, “Codee: A tensor embedding scheme for binary code search,” IEEE Transactions on Software Engineering, 2021.
[58] S. Yang, L. Cheng, Y. Zeng, Z. Lang, H. Zhu, and Z. Shi, “Asteria: Deep learning-based ast-encoding for cross-platform binary code similarity detection,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2021, pp. 224–236
[60] F. Zuo, X. Li, P. Young, L. Luo, Q. Zeng, and Z. Zhang, “Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs,” in Proceedings 2019 Network and Distributed System Security Symposium. Reston, VA: Internet Society, 2019.

Insights

(1) Multiple features are directly linked (concatenate) into one vector
(2) Entropy flow distinguishes compiler and optimization options

Supongo que te gusta

Origin blog.csdn.net/qq_33976344/article/details/127060647
Recomendado
Clasificación