[Paper Sharing] How Machine Learning Is Solving the Binary Function Similarity Problem

How Machine Learning Is Solving the Binary Function Similarity Problem [USENIX 2022]

Andrea Marcelli Cisco Systems, Inc.
Mariano Graziano Cisco Systems, Inc.
Xabier Ugarte-Pedrero Cisco Systems, Inc.
Yanick Fratantonio Cisco Systems, Inc.
Mohamad Mansouri EURECOM
Davide Balzarotti EURECOM

The ability to accurately compute the similarity between two pieces of binary code plays an important role in many different problems. Several research groups such as security, programming language analysis, and machine learning have been working on this topic for more than five years, publishing hundreds of papers on the topic. One would expect that, so far, it will be possible to answer research questions that go beyond the very specific techniques presented in the papers, but generalize to the entire field of research. Unfortunately, this topic is affected by a number of challenges, ranging from reproducibility issues to opacity of research results, which hinder meaningful and effective progress.
In this paper, we set out to conduct the first measurement study of the current state of this research field. We first systematize the existing research system. We then identify a number of related methods that represent a broad range of solutions recently proposed by three different research groups. We reimplemented these methods and created a new dataset (binaries compiled with different compilers, optimization settings, and three different architectures), which allowed us to perform fair and meaningful comparisons. This effort allowed us to answer many research questions beyond what could be inferred from reading individual research papers. By releasing our entire modular framework and our dataset (along with associated documentation), we also hope to inspire future work in this interesting research area.

Bottom Line: Systematic Evaluation of Binary Similarity Methods

introduction

Challenges

The first challenge is that it is currently neither possible to reproduce nor replicate previous results. Sadly, this is a common problem in the security world, and binary similarity is a particularly good example of this problem. Of the 61 solutions reported in the survey by Haq et al. [27], only 12 released their tools to other researchers. Even when artifacts are available, they are often incorrect (e.g., they do not implement the exact same solution as described in the paper), incomplete (e.g., missing important components such as those used for feature extraction), Or the code might not even work on a different dataset than the one its author used. Since re-implementing previous techniques is complex and time-consuming, each solution is usually only compared with a few previous techniques, which are sometimes not even designed to solve the same problem, and in some corner cases , compared only to previous papers by the same author.

A second challenge is that assessment results are often opaque. Different solutions are often aimed at slightly different goals (e.g., searching for vulnerabilities vs. finding similar malware samples), in different settings (e.g., cross-compiler vs. cross-architecture), by using different notions of similarity (same code vs. same semantics), and operate at different granularities (e.g., code fragments vs. whole functions). Experiments are also performed on datasets of different sizes and natures (e.g., firmware vs. command-line utilities), and results are reported by using different metrics (e.g., ROC curves vs. top-n vs. MRR10).

The combined impact of the first two challenges has resulted in an extremely fragmented field, with dozens of techniques existing without a clear understanding of which techniques work (or do not) in which contexts. This brings us to a final challenge: it is difficult to understand the direction of binary similarity research. Each new solution employs a more complex technique, or a new combination of techniques, and it is difficult to tell whether this is driven by practical limitations of simpler approaches, or by the need to convince reviewers that each Novelty driven work.

Contributions

In this paper, we perform the first systematic measurements in this research area. We first explore existing research and group each solution according to the approach taken, paying particular attention to recent successful techniques based on machine learning. We then select, compare and implement the ten most representative methods and their possible variants. These approaches represent broad trends across three distinct research communities: computer security, programming language analysis, and the machine learning community.

By reimplementing various methods (not necessarily "papers"), we isolate existing "primitives" and evaluate them when used alone or in conjunction with each other to gain insights and identify complexities hidden in prior works important factors in and answer various open research questions. To make the evaluation efforts more comparable, we also propose a new dataset as a common benchmark for different aspects such as compiler families, optimizations, and architectures.

Our evaluation highlights several interesting insights. For example, we found that while simple methods (e.g., fuzzy hashing) work well in simple settings, when dealing with more complex scenarios (e.g. cross-architecture datasets, or datasets where multiple variables vary simultaneously ), they fail. Among machine learning models, graph neural network based models achieve the best results in almost all tasks and are the fastest when comparing inference times. Another interesting finding is that many recently published papers have very similar accuracy when tested on the same dataset, although several papers claim state-of-the-art improvements.

While we do not claim that our code or datasets are better or more representative than previous works, we publish our modular framework reimplementing all selected methods, the full dataset, and how to recreate And detailed instructions for tuning it By allowing the community to experiment with individual components and compare them directly with each other, we hope to encourage and ease the efforts of future researchers interested in approaching this active research area.

method

The Binary Function Similarity Problem

In its simplest form, binary function similarity aims to compute a numerical value that captures the "similarity" of a pair of functions in their binary representations, i.e. the raw bytes that make up the function body (i.e. machine code). Note that in this paper we focus on methods that use functions as code units, and researchers have also studied techniques that focus on low-level abstractions (e.g., basic blocks) or high-level abstractions (e.g., entire programs).
Binary function similarity has been studied in more than a hundred papers. To further complicate the situation, most existing approaches cannot be mapped to a single technology category because they are usually built on top of different components.

Measuring Function Similarity

Direct vs. Indirect Comparisons We can divide techniques for measuring functional similarity into two broad categories. The first class of solutions enables direct comparison of pairs of functions by considering the raw input data or implementing some kind of feature extraction. These solutions often require the understanding that two seemingly unrelated values ​​can represent similar functions, and vice versa, that close values ​​do not necessarily represent similar things. This situation arises when features extracted from binary functions cannot be directly compared by using an underlying similarity measure, since they may not be represented in a linear space, or may not be equally weighted on the similarity score. Therefore, the researchers propose to use a machine learning model to determine whether two functions are similar, given a set of extracted features as input. There are several approaches that can be implemented by utilizing Bayesian networks [2], convolutional neural networks [44], graph matching networks (GMN) [40], regular feed-forward neural networks [67], or combinations of them [37] This type of similarity. In these cases, the model is used to output a similarity score between a pair of functions.
To find similar functions, these methods need to search the entire dataset and compare the features of the query function with each entry in the dataset, which is not a scalable solution. For this reason, many approaches implement indexing strategies through techniques such as tree-based data structures, locality-sensitive hashing (approximate nearest neighbor search), bloom filters, custom pre-filters based on simpler data, clustering techniques, Even distributed search methods such as map-reduce [15] to pre-filter potential similar candidates.
The second class of solutions implements the indirect comparison technique. These methods map input features into "condensed" low-dimensional representations that can be easily compared with each other using distance measures such as Euclidean distance or cosine distance. These solutions allow efficient one-to-many comparisons. For example, if a new function needs to be compared against the entire dataset, one can first map each function in the repository to its respective low-dimensional representation (this is a one-time operation), then do the same for the new function, These representations are finally compared by using efficient techniques such as approximate nearest neighbors.

Fuzzy hashing and embeddings A popular example of a low-dimensional representation is fuzzy hashing. Fuzzy hashes are produced by different algorithms than traditional cryptographic hashes because they are intentionally designed to map similar input values ​​to similar hashes. The conclusion is that small changes in the raw bytes of the input can significantly affect the resulting hash. However, even though ordinary fuzzy hashing may not be suitable for function similarity, some methods (such as FunctionSimSearch [18]) have proposed more specialized hashing techniques to compare two functions.
Another popular low-dimensional representation relies on embeddings. The term is popular in the machine learning community and refers to a low-dimensional space in which semantically similar inputs are mapped to points close to each other, regardless of how different the inputs may appear in their original representations. The goal of a machine learning model is to learn how to generate embeddings that maximize the similarity between similar functions and minimize the similarity between different functions. In the literature, we can identify two main types of embeddings: those that try to summarize the code of each function, and those that try to summarize their graph structure.

Code Embedding Many researchers have attempted to leverage existing natural language processing (NLP) techniques to address the binary function similarity problem by processing assembly code as text. These solutions process token streams (e.g., instructions, mnemonics, operands, normalization instructions), outputting one embedding per code block, one embedding per instruction, or both.
The first class of methods (such as Asm2Vec [14] and [64]) is based on word2vec [52,53], a well-known technique in the field of natural language processing. Although these models are not designed for cross-architecture embedding generation, they can be trained on different instruction sets simultaneously, learning the syntax of different languages ​​(but not mapping semantics across languages), or they can be applied on top of intermediate languages.
The second solution is based on the seq2seq encoder-decoder model [69], which allows the semantics of different architectures to be mapped to the same embedding space, thus learning similarities across architectures.
The third type of model builds on BERT [12], which is based on the state-of-the-art pre-trained model in NLP [71]. For example, OrderMatters [78] uses a BERT model pre-trained on four tasks to generate basic block embeddings, while Trex [60] uses a hierarchical transformer and a masked language modeling task to learn approximate program execution semantics, and then Transfer the learned knowledge to identify semantically similar functions.
Assembly code embeddings are often affected by the number of different instructions they can handle (the so-called out-of-vocabulary problem (OOV)) and the maximum number of instructions that can be provided as model input. Therefore, some methods compute instruction-level embeddings, basic block embeddings, or function-level embeddings. Instruction or basic block embeddings sometimes leverage other algorithms (such as longest common subsequence) to compute functional similarity, or they are used as part of more complex models.

Graph Embeddings Another study builds on machine learning methods that compute graph embeddings. These are great for capturing the properties of function-based control flow graphs, which are cross-architectural in nature. These embeddings can be generated by custom algorithms or more sophisticated machine learning techniques, such as graph neural networks (GNNs). Some recent approaches from the machine learning community have proposed variants of GNNs, such as GMN. These changes are able to produce comparable embeddings in the vector space, with the specificity that these embeddings encode information in the two graphs that are input to the model.
Graph embedding methods also often encode the information of each basic block into the corresponding node of the graph to increase expressiveness. For example, some solutions compute a set of attributes for each node, resulting in attributed control flow graphs (attribute control flow Graphs, ACFG), ACFG can be manually designed [24,76], or can be learned automatically in an unsupervised manner [45] . Other authors leverage other embedded computation layers (e.g., at the basic block level [45, 78, 79]) using some of the techniques discussed earlier.

Function Representations

Binary functions are essentially streams of bytes corresponding to architecture-specific machine code and data. Starting from this raw input, the researchers used a number of methods to extract higher-level information that could be used to tell whether two functions came from the same source code. The list, sorted by increasing level of abstraction, includes the following categories.
(1) Raw bytes Some solutions directly use raw binary information as a starting point for similarity measures
(2) Assembly Assembly instructions are obtained by a disassembler when operations can be performed in many different ways depending on instruction size or operands When encoding, assembly instructions are useful
(3) Normalized assembly assembly code usually encodes constants
(4) Intermediate representations Some methods work at a higher level of abstraction by lifting the binary representation to an intermediate representation (IR)
(5) Structure Many methods attempt to capture the internal structure of a given function, or the role a function plays in the overall program
(6) Data flow analysis Implementations of arithmetic expressions at the assembly level can take different forms to achieve the same semantics
(7) Dynamic analysis Some methods rely on dynamic analysis
(8) Symbolic execution and analysis As opposed to concrete dynamic execution, some methods rely on symbolic execution to fully capture the behavior of the function being analyzed and to determine the relationship between its inputs and outputs

Selected Approaches

One of the main contributions of our work is to provide reference implementations for some key methods and compare them by conducting experiments on public and synthetic datasets. Ideally, one would evaluate as many methods as possible, but obviously, reimplementing them is not feasible. It is also important to understand that while there are hundreds of papers on the topic, many of them are small variations of the same technique, while the number of novel solutions is significantly smaller.
Scalability and real-world applicability . We are interested in methods that have the potential to scale to large datasets and that can be applied to real-world use cases. Therefore, we do not evaluate methods that are inherently slow and only focus on direct comparisons, such as those based on dynamic analysis, symbolic execution, or high-complexity graph-related algorithms.
Focus on representative methods rather than specific papers. There are many research works proposing small variations of the same approach—for example, by reusing previous techniques while slightly changing the features used. This often results in similar overall accuracies, which makes them less interesting for our comparison.
Cover different communities . Research contributions to the binary function similarity problem come from different research groups and from academia and industry.
Prioritize the latest trends . While the first contributions to this area of ​​research date back more than a decade, interest has exploded recently.

Of the many papers published in the past decade, only a fraction meet the above criteria. Based on our analysis, we identified 30 technologies, as shown in Figure 1, from which we selected 10 representative solutions for our study.

insert image description here
The diagram on the left side of Figure 1 shows the clustering methods according to the respective research groups. These groups come from both academia and industry - both Google and Tencent are very active in this area. Sidelines represent other solutions to which each paper compares its results. For example, an arrow between Gemini and Genius indicates that the authors compared Gemini's results with those previously obtained by Genius (both from the same group). The right part of Figure 1 shows the timeline of publishing on the Y-axis and different types of input data on the X-axis. These methods are then clustered into three major categories, namely fuzzy hashing, graph embedding, and code embedding, according to different methods for computing similarity.
Both diagrams use labels (in parentheses) to identify communities ([S] Security, [PL] Programming Languages, [ML] Machine Learning, and [SE] Software Engineering). We also use [Mono] and [Cross] tags to denote whether the proposed method focuses on single-architecture or cross-architecture scenarios, respectively.
Even though the graph in Figure 1 is not comprehensive and only shows our selection of papers, it again depicts how several papers only compare to a limited set of methods before. We can also extract other interesting information from these plots. First, the binary diffing tools grouped in the middle box are all designed for directly comparing two binaries (e.g., they use call graphs), and they are all single-architecture. Second, the graph shows that different communities tend to be rather closed, and they rarely compare with papers in other fields. This is an obvious limitation in advancing research on functional similarity, and we hope this paper will foster collaboration between different fields. Finally, we can identify seminal papers such as Gemini [76] and discoverRE [20] that have been re-implemented and extensively tested in other studies. Such work has clearly inspired other researchers to advance the state of the art.
The timeline graph on the right shows a clear trend: the complexity of solutions and the use of machine learning grows over time. We use this information and the relationships described in the figure to select 10 state-of-the-art solutions that are scalable, representative, and state-of-the-art. At the same time, we attempted to maximize variance between research groups.

insert image description here

insert image description here

insert image description here

insert image description here

experiment

Implementation

One of the goals of this study is to make a fair comparison between different methods. For this reason, we implemented each stage of evaluation in a unified manner, including binary analysis, feature extraction, and machine learning implementation. In this way, it is possible to establish a common basis for meaningful and fair comparisons of different approaches.
For the binary analysis stage we use IDA Pro 7.3, while for feature extraction we rely on a set of Python scripts using the IDA Pro api, Capstone and NetworkX. We implemented all neural network models in Tensorflow 1.14, with the sole exception of Trex [60], which is built on top of Fairseq [57], a sequence modeling toolkit for PyTorch. Finally, we implement Asm2Vec [14] using Gensim 3.8 [65] and run instruction embedding models [45, 49].
Along the way, we adopted a unified implementation to minimize computational differences and introduced several code optimizations. When the code was not available, we contacted the author, but we either received no reply or received limited support. Both methods Zeek [67] and Asm2Vec [14] have been completely reimplemented, while CodeCMR was tested by the authors due to the high complexity of the model and several "hidden" variables not discussed in this paper.
Additional technical details of all our implementations, as well as information about our efforts to contact the respective authors and considerations regarding the use of pretrained models, can be found in [47].

insert image description here

Datasets

We created two new datasets, Dataset-1 and Dataset-2, designed to capture the complexity and variability of real-world software while covering different challenges of binary function similarity: (i) multiple compiler families and versions, (ii) multiple compiler optimizations, (iii) multiple architectures and bits, and (iv) different natures of software (command line utilities vs. GUI applications). We use dataset 1 to train the machine learning model and use two datasets to test the evaluated method.

insert image description here

According to our definition of function similarity, we disable function inlining to compare functions from the exact same source code: function inlining is actually code added in the original source code, which can pollute our results and lead to misleading Sexual conclusions.

insert image description here

For the benefit of the community and to facilitate future work in this area, we release the full dataset to the public, available in [47]. We also released the scripts and patches used to compile them so that future researchers can recreate the datasets and build on our work.

Fuzzy-hashing Comparison

Catalog1 uses raw bytes as input features and different signature sizes (i.e. the number of hash functions): we show results for two variants, one with size 16 and the other with size 128. In contrast, FunctionSimSearch (FSS) uses a combination of graphlets (G), mnemonics (M), and immediacy (I): we incrementally enable different types of input features for different tests, including their weighting linear combination www .
Since fuzzy hashing methods are not affected by the training phase, we use them for a targeted evaluation of how each compilation variable affects the comparison of the binary function. Therefore, for these methods, we first perform multiple experiments in which we vary one variable (ie, compiler family, version, options, architecture, and bits) while we keep the rest constant. The results in Table 1 clearly show that even something as simple as fuzzy hashing works when only one free variable is considered at a time: "raw" bytes prove to be a good property for comparisons of the same architectures, while graphlets are better across Valid in schema comparisons. For Catalog1, larger signature sizes provide better performance, but they are limited by the total number of hash functions included in the implementation.

insert image description here

We then evaluate the two methods with the six tasks proposed previously. Tables 2 and 4 show the results on Dataset 1 and Dataset 2: having multiple free variables at the same time is a much more difficult problem, and simple methods are no longer effective. In the XC task (Table 2), Catalog1 and FSS have the same AUC. For FSS, the graphlet-only (G) configuration is best for all tasks except XC and XO, where graphlets with mnemonics (G+M) have higher AUCs .

insert image description here
insert image description here

Machine-learning Models Comparison

We evaluate all selected methods by using public training datasets (except Trex [60]) extracted from dataset-1, and create positive and negative samples using similar criteria as the XM task. However, it is important to note that the results for each task can be further improved by using task-specific training data. We did perform this evaluation, but we ignored the results because we noticed that training on the most general data (XM) yielded close to the best overall performance for each task.
Comparing machine learning models, especially deep neural networks, is a challenging task because there are several variables that can affect the final results, including model implementation and configuration (e.g., number of layers or type of recurrent neural network), Different hyperparameters (e.g., learning rate and batch size), loss function, optimizer and number of training epochs. To be as consistent as possible in our comparisons, all models were trained using the same randomly generated data drawn from 256,625 unique binary functions. Furthermore, we conduct extensive experiments to evaluate different feature sets, different model configurations, hyperparameters and loss functions. The results of each model can be improved by using extensive grid search methods, and our proposed results can be used as a starting point for future work.
Tables 3 and 4 show the results of testing the models and their respective variants on the two datasets. Table 8 contains some general information about the model and its training, such as the number of parameters, batch size, number of epochs, and training time per epoch.

insert image description here
insert image description here

The results show that among models that produce vector representations of functions (i.e., embeddings), GNNs from [40] achieve the best values ​​across all metrics and across all tasks. We also notice that most machine learning models perform very similarly on AUC but differ on ranking metrics (MRR10 and recall@1), as shown in Figure 2. However, for other embedding models, SAFE [49] provides AUC better than GNN [45] with unsupervised features, and slightly better than Gemini [76] in one specific configuration. For methods performing direct comparisons, GMN in [40] is the best performing model across all tasks, while Zeek has slightly lower AUC (except for large functions), but much lower MRR10 and recall@1.

insert image description here

As shown in Table 5. The GNN model of BoW using IDA microcode instructions has higher AUC than the GNN model of BoW using opcodes, but the second model has higher recall for large K values ​​(Fig. 3). In general, all metrics of the BinaryAI/CodeCMR model are higher than other models we tested. If these results are validated by independent studies in the community, this could be a very promising research direction.
insert image description here
insert image description here

Vulnerability Discovery Use Case

As an example of a security application, we test all models on a vulnerability discovery task. To this end, we selected 10 vulnerable functions from OpenSSL1.0.2d, covering a total of 8 CVEs. As a target, we chose the libcrypto library embedded in two firmware images: Netgear R7000 (ARM 32-bit) and TP-Link Deco M4 (MIPS 32-bit). Details of the vulnerabilities affecting each firmware image are included in [47]. We compiled ten vulnerable functions for four architectures (x86, x64, ARM 32-bit, MIPS 32-bit) and performed a ranking evaluation similar to the one we presented in previous tests. We only use vulnerable functions of a specific firmware image as queries when evaluating vulnerability findings.
The results are shown in Table 7: We use MRR10 as a comparison metric to evaluate how each model ranks the target fragile function for each query function.

insert image description here
However, the FSS model with custom weights has the highest MRR10 in the x64 comparison compared to the Netgear R7000. We used the weights that come with the code, which have been optimized for comparison with OpenSSL. This proves that the optimization process implemented by FSS has practical use cases, but it cannot be extended to other configurations. Table 7 also shows the comparison between different architectures, especially Netgear's ARM32 column and TP-Link's MIPS32 column show the comparison of the same architecture. The Netgear R7000 firmware is compiled for ARM 32-bit, while the TP-Link Deco-M4 is compiled for MIPS 32-bit: this shows why Asm2Vec has a high MRR10 value in the corresponding column. Finally, Table 6 contains the actual ranking results of fragile functions for Netgear R7000 images, and it can be seen that in practice, higher MRR10 values ​​may hide lower rankings.

insert image description here

Summarize

Discussion

What is the main contribution of the new machine learning solution compared to simpler fuzzy hashing methods? Deep learning models provide an efficient way to learn function representations (i.e. embeddings), enforcing a trade-off between different types of functions Spatial separation. Unlike fuzzy hashing methods, machine learning models achieve high accuracy even when multiple compilation variables change simultaneously, and they benefit from the advantages of large training datasets built on a solid ground truth defined by the compilation options.

What is the role of different feature sets? It turns out that the choice of machine learning model type, especially the choice of GNN and loss function, is as important as the features in the input. Using basic block features (e.g., ACFG) gives better results, but there is little difference between carefully handcrafted features and simpler features (e.g., bag-of-words of basic block opcodes).

Do different methods perform better on different tasks? In particular, are cross-architecture comparisons more difficult than using a single architecture? Our evaluations show that most machine learning models perform remarkably similarly on all evaluation tasks, regardless of Whether in the same schema or across schemas.

Is there any specific research direction that seems more promising for future design of new technologies? It turns out that deep learning models have scalability and accuracy requirements for different functional similarity tasks, especially due to the ability to learn methods suitable for multiple A functional representation of a task. While the GNN model provided the best results, there were dozens of different variants to test.

References

[2] Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad Debbabi. Fossil: a resilient and efficient system for identifying foss functions in malware binaries. ACM Transactions on Privacy and Security (TOPS),21(2):1–34, 2018.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
[14] Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 472–489, San Francisco, CA, USA, May 2019. IEEE.
[18] Thomas Dullien. Searching statically-linked vulnerable library functions in executable code . https://googleprojectzero.blog
spot.com/2018/12/searching-statically-linked-vulnerable.html.
[24] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable Graph-based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 480–491, Vienna Austria, October 2016. ACM.
[37] Nathaniel Lageman, Eric D Kilmer, Robert J Walls, and Patrick DMcDaniel. Bindnn: Resilient function matching using deep learning. In International Conference on Security and Privacy in Communication Systems, pages 517–537. Springer, 2016.
[40] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR, 2019.
[44] Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. adiff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 667–678, 2018.
[45] Luca Massarelli, Giuseppe A. Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. In Proceedings 2019 Workshop on Binary Analysis Research, San Diego, CA, 2019. Internet Society.
[47] Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. How Machine Learning Is Solving the Binary Function Similarity Problem — Artifacts and Additional Technical Details. https://github.com/Cisco-Talos/binary_function_similarity.
[49] Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. Safe: Self-attentive function embeddings for binary similarity. In Proceedings of Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 2019.
[52] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. 2013.
[53] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26:3111–3119, 2013.
[57] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
[60] Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680, 2020.
[64] Kimberly Redmond, Lannan Luo, and Qiang Zeng. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. In NDSS Workshop on Binary Analysis Research (BAR), 2019.
[65] Radim Rehurek and Petr Sojka. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2), 2011.
[67] Noam Shalev and Nimrod Partush. Binary Similarity Detection Using Machine Learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security - PLAS ’18, pages 42–47, Toronto, Canada, 2018. ACM Press.
[71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[76] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, page 363–376, New York, NY, USA, 2017. Association for Computing Machinery.
[78] Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou ouang, and Shi Wu. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1145–1152, April 2020.
[79] Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems, 33, 2020.

Insights

(1) A unified evaluation of 10 representative methods, https://github.com/Cisco-Talos/binary_function_similarity
(2) Using basic block features (e.g., ACFG) can provide better results, but in carefully hand-designed features and simpler features such as bag-of-words for basic block opcodes.

Guess you like

Origin blog.csdn.net/qq_33976344/article/details/130987005