Top Academic Conference on Software Engineering - ESEC/FSE 2022 Topics (Network Security Direction) List, Abstract and Summary

Summarize

Cyber security-related topics in this conference cover different research fields such as blockchain, smart contracts, symbolic execution, and browser API fuzz testing.

Popular research directions:

1. Vulnerability detection and repair based on deep learning

2. AI-based automatic vulnerability repair

3. Fuzz testing and vulnerability discovery

Unpopular research directions:

1. Vulnerability analysis of multi-language code

2. Software security in code review

3. Browser API fuzz testing

Suggestions for future research directions:

1. End-to-end automatic vulnerability management combining program analysis and AI technology

2. AI-driven detection and repair of 0day vulnerabilities

3. Construction of secure machine learning system and defense against adversarial samples

1、An empirical study of blockchain system vulnerabilities: modules, types, and patterns

Blockchain, a distributed ledger technology, is growing in popularity, especially for supporting valuable cryptocurrencies and smart contracts. However, blockchain software systems inevitably have many bugs. Although bugs in smart contracts have been extensively studied, security bugs in the underlying blockchain system have been less explored. In this paper, we conduct an empirical study of blockchain system vulnerabilities on four representative blockchain systems (Bitcoin, Ethereum, Monero, and Stellar). Specifically, we first designed a systematic filtering process to effectively identify 1,037 vulnerabilities and their 2,317 patches from 34,245 issues/PR (pull requests) and 85,164 commits on GitHub. Therefore, we built the first blockchain vulnerability dataset, available at [https://github.com/VPRLab/BlkVulnDataset ↗](https://github.com/VPRLab/BlkVulnDataset). We then performed a unique analysis on this dataset at three levels, including (i) classifying vulnerable modules at the file level by identifying and correlating module paths between projects, (ii) through natural language processing and similarity-based Degree of sentence clustering clusters text-level vulnerability types, and (iii) code-level vulnerability patterns are analyzed by generating and clustering code change signatures that capture syntactic and semantic information of patched code fragments.

Our analysis revealed three key findings: (i) some blockchain modules are more vulnerable than others; in particular, each module related to consensus, wallets, and network had more than 200 issues; (ii) approx. 70% of blockchain vulnerabilities are of traditional types, but we also identified four new blockchain-specific types; (iii) we obtained 21 unique blockchain-specific vulnerability patterns, capturing unique blockchain properties and states, and demonstrate that they can be used to detect similar vulnerabilities in other popular blockchains such as Dogecoin, Bitcoin SV, and Zcash.

Paper link: https://doi.org/10.1145/3540250.3549105

2、Automated generation of test oracles for RESTful APIs

In recent years, there has been a proliferation of tools for generating test cases for RESTful APIs. However, despite their promising results, they all face the same limitation: they can only detect crashes (i.e. server errors) and non-compliance with API specifications. In this paper, we propose a technique to automatically generate RESTful API test cases by detecting invariants. In practice, our approach aims to learn the expected properties of the output by analyzing previous API requests and their responses. To this end, we extend the popular tool Daikon for dynamic detection of possible invariants. In a preliminary evaluation of 8 operations on 6 industrial APIs, the overall accuracy reached 66.5% (reaching 100% in 2 operations). Additionally, our approach revealed 6 reproducible bugs in APIs with millions of users: Amadeus, GitHub, and OMDb.

Paper link: https://doi.org/10.1145/3540250.3559080

3、Automated unearthing of dangerous issue reports

The Coordinated Vulnerability Disclosure (CVD) process, commonly used for open source software (OSS) vulnerability management, recommends reporting discovered vulnerabilities privately and keeping related information confidential until formal disclosure. However, in practice, due to various reasons (e.g., lack of security domain expertise or security management awareness), many vulnerabilities are first reported through public issue reports (IRs) before being formally disclosed. These IRs are dangerous IRs because attackers can exploit the leaked vulnerability information to launch zero-day attacks. It is crucial to identify such dangerous IRs early so that OSS users can start the vulnerability remediation process earlier and OSS maintainers can manage dangerous IRs in a timely manner. In this paper, we propose and evaluate a deep learning-based method, named MemVul, for automatically identifying dangerous IRs when they are reported. MemVul enhances neural networks by adding a memory component that stores knowledge of external vulnerabilities from the Common Weakness Enumeration (CWE). We rely on publicly accessible CVE Reference IRs (CIRs) to implement the concept of dangerous IRs. We mined 3,937 CIRs distributed among 1,390 OSS projects hosted on GitHub. Evaluated in real-life scenarios with high data imbalance, MemVul achieves the best trade-off between precision and recall. In particular, MemVul’s F1 score (i.e., 0.49) improves by 44% over the best baseline. For IRs that were predicted to be CIRs but not reported to CVEs, we conducted user studies to investigate their usefulness to OSS stakeholders. We observed that 82% (41 out of 50) of these IRs were security-related, with 28 security experts recommending public disclosure, demonstrating MemVul's ability to identify undisclosed dangerous IRs.

Paper link: https://doi.org/10.1145/3540250.3549156

4、Avgust: automating usage-based test generation from videos of app executions

Writing and maintaining UI tests for mobile applications is a time-consuming and tedious task. Although decades of research have produced automated methods for automatically generating UI tests, these methods typically focus on testing for crashes or maximizing code coverage. In contrast, recent research shows that developers prefer usage-based testing that focuses on specific application feature usage to help support activities such as regression testing. Few existing technologies support generating such tests, as this requires automating the difficult task of understanding the semantics of the UI interface and user input. In this article, we introduce a tool called Avgust that automates the key steps of generating usage-based tests. Avgust uses neural models for image understanding, processing video recordings used by applications, to synthesize application-agnostic state machine encodings. Avgust then uses this encoding to synthesize test cases for the new target application. We evaluated Avgust on 374 videos of common usage scenarios for 18 popular applications and showed that 69% of the test cases were able to successfully perform the desired usage patterns and that Avgust's classifier outperformed the state of the art.

Paper link: https://doi.org/10.1145/3540250.3549134

5、Blackbox adversarial attacks and explanations for automatic speech recognition

Automatic speech recognition (ASR) models are widely used in applications such as voice navigation and voice control of home appliances. The computational core of ASR is a deep neural network (DNN), which has been shown to be susceptible to adversarial perturbations and suffers from undesirable biases and ethical issues. To evaluate the security of ASR, we propose some techniques for generating black-box (DNN-agnostic) adversarial attacks that are portable to various ASRs. This is a different approach compared to existing work that focuses on white-box attacks that are time-consuming and lack portability. Furthermore, in order to figure out why ASR (always a black box) is vulnerable, we provide some explanation methods about ASR to help increase our understanding of the system and ultimately build trust in the system.

Paper link: https://doi.org/10.1145/3540250.3558906

6、CLIFuzzer: mining grammars for command-line invocations

The behavior of command-line utilities can be affected by passing command-line options and parameters (configuration settings) that enable, disable, or otherwise affect the portions of code to be executed. Therefore, system testing of command line utilities requires testing with different configurations of the various supported command line options.

We introduce CLIFuzzer, a tool that takes an executable program and uses dynamic analysis to trace input processing, automatically extracting its complete set of options, arguments, and argument types. This set forms a grammar that represents valid options and valid sequences of arguments. By generating calls from this syntax, we can fuzz the program using an infinite list of random configurations, covering the relevant code. This will lead to higher coverage and new bug discoveries than a pure mutational fuzzer.

Paper link: https://doi.org/10.1145/3540250.3558918

7、CodeMatcher: a tool for large-scale code search based on query semantics matching

Thanks to the emergence of large-scale code repositories such as GitHub and Gitee, searching and reusing existing code can help developers significantly increase software development productivity. Over the years, many code search tools have been developed. Early tools utilized information retrieval (IR) techniques to perform efficient code searches in large and frequently changing code bases. However, the search accuracy is lower due to the semantic mismatch between query and code. In recent years, many tools have utilized deep learning (DL) technology to solve this problem. However, DL-based tools are slow and the search accuracy is unstable.

In this paper, we introduce CodeMatcher, an IR-based tool that inherits the advantages of DL-based tools in query semantic matching. Typically, CodeMatcher indexes large-scale code bases first to speed up search response times. For a given search query, it processes irrelevant and noisy words in the query, then retrieves candidate codes from the indexed code base through an iterative fuzzy search, and finally rearranges the candidate codes based on two design metrics between the query and the candidate codes. . We implement CodeMatcher as a search engine website. To validate the effectiveness of our tool, we evaluated CodeMatcher on over 41,000 open source Java code bases. Experimental results show that CodeMatcher can achieve industrial-grade response time (0.3 seconds) on an ordinary server equipped with Intel-i7 CPU. In terms of search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub Search and Google Search).

Paper link: https://doi.org/10.1145/3540250.3558935

8、DeJITLeak: eliminating JIT-induced timing side-channel leaks

The timing side channel can be exploited to infer confidential information when the execution time of the program is related to the confidential information. Recent research shows that just-in-time compilation (JIT) can introduce new timing side-channels even when time-balanced at the source code level. In this paper, we propose a novel approach to eliminate JIT-induced leaks. We first formalize the timing side-channel security under JIT compilation through the concept of time balancing, laying a foundation for reasoning about programs with JIT compilation. Then, we propose to eliminate JIT-induced leaks through fine-grained JIT compilation. To this end, we provide a method to automatically generate compilation policies and a novel type system to ensure their sanity. We developed a tool called DeJITLeak that works with actual Java code and implements fine-grained JIT compilation in the HotSpot JVM. Experimental results show that DeJITLeak can effectively and efficiently eliminate JIT-induced leakage on three widely adopted benchmarks in a side-channel detection setting.

Paper link: https://doi.org/10.1145/3540250.3549150

9、Demystifying the underground ecosystem of account registration bots

Member services are a core part of most online systems. For example, in online social networks and video platforms, membership services enable users to be provided with customized content or track their footsteps for recommendations. However, there is a dark side behind membership services, including influencer marketing, coupon collection and spreading false news. All of these activities rely heavily on having a large number of fake accounts, and to efficiently create new accounts, malicious registrants use automated registration bots and anti-human verification services that can easily bypass website security policies. In this article, we take the first step towards understanding account registration bots and the anti-human verification services they use. Through a comprehensive analysis, we identified the three most popular anti-human verification services. We then conducted experiments on these services from an attacker's perspective to verify their effectiveness. The results show that all of these services are able to easily bypass the security strategies adopted by website providers to prevent fake registrations, such as SMS verification, verification codes and IP monitoring. We further estimate the market size of the underground registration ecosystem at approximately $4.8 million to $128.1 million per year. Our research shows that we need to urgently think about the effectiveness of our registration security policies and prompts us to develop new strategies to better protect our systems.

Paper link: https://doi.org/10.1145/3540250.3549090

10、Diet code is healthy: simplifying programs for pre-trained models of code

Pretrained code representation models (such as CodeBERT) have demonstrated excellent performance in a variety of software engineering tasks, however they tend to be heavy in complexity, quadratic with the length of the input sequence. We conduct an empirical analysis of CodeBERT's attention and find that CodeBERT pays more attention to certain types of tags and statements, such as keywords and data-related statements. Based on these findings, we propose DietCode, which aims to leverage source code for lightweight, large-scale pre-trained models. DietCode simplifies the input procedure of CodeBERT through three strategies, namely word discarding, frequency filtering and attention-based strategy, which selects statements and tags that receive the most attention weight during pre-training. Therefore, it significantly reduces computational costs without affecting model performance. Experimental results on two downstream tasks show that DietCode is 40% less computationally expensive than CodeBERT in fine-tuning and testing and provides comparable results.

Paper link: https://doi.org/10.1145/3540250.3549094

11、FastKLEE: faster symbolic execution via reducing redundant bound checking of type-safe pointers

Symbolic execution (SE) has been widely used in automated program analysis and software testing. Many SE engines (such as KLEE or Angr) require some intermediate representation (IR) of the code to be interpreted during execution, which can be slow and expensive. Although many studies have proposed methods to accelerate SE, few of them consider optimizing the internal interpretation operations. In this paper, we propose FastKLEE, a faster SE engine designed to speed up execution by reducing redundant bounds checking of type-safe pointers during IR code interpretation. Specifically, in FastKLEE, a type inference system is first utilized to classify the pointer types of the most commonly interpreted read/write instructions (i.e., safe or unsafe). Then, a customized memory operation is designed that only performs bounds checking on unsafe pointers and omits redundant checks on safe pointers. We implemented FastKLEE on top of the well-known SE engine KLEE and combined it with the well-known type inference system CCured. Evaluation results show that FastKLEE is able to reduce the time by up to 9.1% (5.6% on average) relative to the state-of-the-art method KLEE that also explores the same number (i.e. 10k) execution paths. The open source code of FastKLEE can be found at [https://github.com/haoxintu/FastKLEE↗](https://github.com/haoxintu/FastKLEE). A video demonstration of FastKLEE can be viewed at [https://youtu.be/fjV_a3kt-mo↗](https://youtu.be/fjV_a3kt-mo).

Paper link: https://doi.org/10.1145/3540250.3558919

12、Fault localization to detect co-change fixing locations

Fault localization (FL) is the leading step in most automated program repair (APR) methods and repairs problematic statements identified by FL tools. We propose FixLocator, a deep learning (DL)-based fault localization method that supports detecting problematic statements in one or more methods in the same fix and modifying them accordingly. We call them the co-modified (CC) locations of faults. We regard this FL problem as dual-task learning, using two models. Method-level FL models MethFL learns methods that need to be fixed together. The statement-level FL model StmtFL learns statements that require joint repair. Correct learning of one model can benefit another model and vice versa. Therefore, we train them simultaneously by cross-connecting units to soft-share model parameters to achieve influence propagation between MethFL and StmtFL. Additionally, we explore a new feature for FL: co-modified statements. We also use graph-based convolutional networks to integrate different types of program dependencies.

Our empirical results show that FixLocator locates 26.5% - 155.6% more co-modified statements relative to state-of-the-art statement-level FL baseline models. To evaluate its usefulness in APR, we combine FixLocator with state-of-the-art APR tools. The results show that FixLocator+DEAR (the original FL in DEAR is replaced by FixLocator) and FixLocator+CURE improve the number of errors fixed by 10.5% and 42.9% respectively compared with the original DEAR and Ochiai+CURE.

Paper link: https://doi.org/10.1145/3540250.3549137

13、Fuzzing deep-learning libraries via automated relational API inference

In recent years, deep learning (DL) has attracted widespread attention. At the same time, errors in DL systems may lead to serious consequences and may even threaten human life. Therefore, more and more research is devoted to DL model testing. However, there is still limited testing efforts for DL libraries such as PyTorch and TensorFlow, which are the basis for building, training, and running DL models. Previous work on DL library fuzz testing can only generate tests for APIs that have been called by documentation examples, developer tests, or DL models, leaving a large number of untested APIs. In this paper, we propose DeepREL, the first method for automatically inferring relational APIs for more efficient fuzzing of DL libraries. Our basic assumption is that in the DL library under test, there may be many APIs that share similar input parameters and outputs; in this way, we can easily "borrow" test inputs from the called APIs to test other relational types API. Additionally, we formalize the concepts of value equivalence and state equivalence for relational APIs as expected consequences of efficient error discovery. We have implemented DeepREL as a fully automated end-to-end relational API inference and DL library fuzz testing technology, which can: 1) automatically infer potential API relationships based on the syntax/semantic information of the API; 2) synthesize for calling relational Specific testing procedures for the API; 3) validation of the inferred relational API with representative test inputs; and finally 4) fuzz testing of the validated relational API to uncover potential inconsistencies. We evaluate two of the most popular DL libraries, PyTorch and TensorFlow, and the results show that DeepREL can cover 157% more APIs than the state-of-the-art FreeFuzz. To date, DeepREL has detected a total of 162 bugs, 106 of which have been confirmed by developers as previously unknown bugs. Astonishingly, DeepREL detected 13.5% of high-priority bugs in the entire PyTorch issue tracking system over a three-month period. Additionally, in addition to these 162 coding errors, we found 14 documentation errors (all confirmed).

Paper link: https://doi.org/10.1145/3540250.3549085

14、Generating realistic vulnerabilities via neural code editing: an empirical study

The availability of large-scale, realistic vulnerability data sets is critical to evaluate existing technologies and develop effective data-driven approaches to software security. However, such datasets are severely lacking. One promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are abundantly available. Therefore, in this paper, we explore the feasibility of injecting vulnerabilities through neural code editing. Using synthetic and real datasets, we investigate the potential and gaps of three state-of-the-art neural code editors in injecting vulnerabilities. We found that the editors in these studies had key limitations, with the best accuracy being only 10.03% on real datasets, compared to 79.40% on synthetic datasets. Although graph-based editors are more efficient than sequence-based editors (successfully injecting vulnerabilities in up to 34.93% of real-world test samples), they are still limited by complex code structures and long editing times, due to their inefficiency in preprocessing and The design of deep learning (DL) models is insufficient. We reveal the potential of neural code editing to generate realistic vulnerable samples as they can improve the F1 score of DL-based vulnerability detectors by up to 49.51%. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not good at replacing code) and actionable recommendations for addressing these gaps (e.g., designing effective editing primitives).

Paper link: https://doi.org/10.1145/3540250.3549128

15、Group-based corpus scheduling for parallel fuzzing

Parallel fuzz testing relies on hardware resources to ensure test throughput and efficiency. In industrial practice, it is well known that parallel fuzz testing faces the challenge of task partitioning, but most studies ignore the important corpus allocation process. In this paper, we propose a group-based corpus scheduling strategy to solve these two problems, which has been accepted by the LLVM community. We implemented a parallel fuzz testing tool called glibFuzzer based on this strategy. glibFuzzer first divides the global corpus into different subsets and then assigns different energy scores and difference scores to them. The energy score is mainly determined by the seed size and the length of the coverage information, while the difference score can describe the degree of difference in code covered by different subsets of seeds. In each round of critical local corpus construction, the master node selects high-quality seeds by combining these two scores to improve testing efficiency and avoid task conflicts. To demonstrate the effectiveness of this strategy, we conduct extensive evaluations on real-world programs and FuzzBench. After 4 × 24 CPU hours of testing, glibFuzzer covered 22.02% more branches than libFuzzer in 18 real-world programs and executed 19.42 times more test cases. Compared with AFL, PAFL and UniFuzz, the average branch coverage of glibFuzzer increased by 73.02%, 55.02% and 55.86% respectively. What's more, glibFuzzer discovered more than 100 unique vulnerabilities.

Paper link: https://doi.org/10.1145/3540250.3560885

16、How to better utilize code graphs in semantic code search?

Semantic code search greatly facilitates software reuse, enabling users to find snippets of code that closely match user-specified natural language queries. Due to the rich expressive power of code graphs (such as control flow graphs and program dependency graphs), both mainstream research efforts (i.e., multimodal models and pre-trained models) attempt to incorporate code graphs into code modeling. However, they still have some limitations: First, there is still a lot of room for improvement in search performance. Second, they have not fully considered the unique characteristics of code graphs. In this paper, we propose a graph-to-sequence converter called G2SC. By converting the code graph into a lossless sequence, G2SC is able to solve the small graph learning problem through sequence feature learning and capture the edge and node attribute information of the code graph. Thus, the effect of code search can be greatly improved. In particular, G2SC first transforms the code graph into unique corresponding node sequences through a specific graph traversal strategy. Then, a sequence of statements is obtained by replacing each node with the corresponding statement. A carefully designed set of graph traversal strategies ensures that the process is one-to-one and reversible. G2SC is able to capture rich semantic relationships (such as control flow, data flow, node/relationship attributes) and provide data transformation suitable for learning models. It can be flexibly integrated with existing models to better utilize code graphs. As a proof-of-concept application, we propose two G2SC-enabled models: GSMM (G2SC-enabled multimodal model) and GSCodeBERT (G2SC-enabled CodeBERT model). Extensive experimental results on two real-world large-scale datasets show that GSMM and GSCodeBERT improve the performance of the previous state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, respectively, and improve MRR by 92% and 22% respectively. 63% and 11.5%.

Paper link: https://doi.org/10.1145/3540250.3549087

17、Input splitting for cloud-based static application security testing platforms

As software development teams adopt DevSecOps practices, application security increasingly becomes the responsibility of development teams, who need to set up their own static application security testing (SAST) infrastructure. Since development teams often do not have the necessary infrastructure and expertise to set up custom SAST solutions, there is a growing need for cloud-based SAST platforms that run various static analyzers as a service. Adding new static analyzers to a cloud-based SAST platform can be challenging because static analyzers vary widely in complexity, from efficiently scaling code inspection tools to inter-procedural data using cubic or more complex algorithms. Streaming engine. Careful manual evaluation is required to determine whether the new analyzer will slow down the platform's overall response time or whether it will time out frequently. We explored whether this problem could be simplified by splitting the input to the analyzer into multiple partitions and analyzing these partitions independently. Depending on the complexity of the static analyzer, the partition size can be adjusted to improve overall response time. We report an experiment in which we ran it with and without segmentation input using different analysis tools. Experimental results show that a simple partitioning strategy can effectively reduce the run time and memory usage of each partition without significantly affecting the tool's results.

Paper link: https://doi.org/10.1145/3540250.3558944

18、KVS: a tool for knowledge-driven vulnerability searching

We extract and organize the scattered vulnerability information in the existing vulnerability management library to make it easier to quickly locate and search for specific vulnerabilities and their solutions. To solve this problem, we extract knowledge from vulnerability reports and organize vulnerability information into the form of a knowledge graph. Then, we implemented a knowledge-driven vulnerability search tool, namely KVS. This tool mainly uses the BERT model to identify vulnerability named entities and build a vulnerability knowledge graph (VulKG). Finally, we can search for vulnerabilities of interest based on VulKG. The URL of the tool is [https://cinnqi.github.io/Neo4j-D3-VKG/ ↗](https://cinnqi.github.io/Neo4j-D3-VKG/). The video of our demonstration can be viewed at [https://youtu.be/FT1BaLUGPk0↗](https://youtu.be/FT1BaLUGPk0).

Paper link: https://doi.org/10.1145/3540250.3558920

19、MANDO-GURU: vulnerability detection for smart contract source code by heterogeneous graph embeddings

Smart contracts are increasingly used in high-value applications in blockchain systems. It is important to ensure the quality of the smart contract source code before deployment. This paper proposes a new deep learning-based tool, named MANDO-GURU, designed to accurately detect smart contract vulnerabilities at the coarse-grained contract level and fine-grained row level. By using a combination of control flow graphs and call graphs of Solidity code, we design new heterogeneous graph attention neural networks to encode more structural and latent semantic relationships between different types of nodes and edges in the graph, and use Encoded embedding of graphs and nodes to detect vulnerabilities. Validating on a real-world smart contract dataset, we find that MANDO-GURU can significantly improve many other vulnerability detection techniques in terms of contract-level F1 scores, with improvements up to 24% depending on the vulnerability type. It is the first learning-based Ethereum smart contract tool that can identify vulnerabilities at the line level and improves on traditional code analysis-based techniques by up to 63.4%. Our tools are publicly available at [https://github.com/MANDO-Project/ge-sc-machine↗](https://github.com/MANDO-Project/ge-sc-machine is publicly available). A beta version is currently deployed at [http://mandoguru.com ↗](http://mandoguru.com), and a demo video of our tool can be viewed at [http://mandoguru.com/demo-video↗] (Watch at http://mandoguru.com/demo-video).

Paper link: https://doi.org/10.1145/3540250.3558927

20、MOSAT: finding safety violations of autonomous driving systems using multi-objective genetic algorithm

Automated driving systems (ADS) are safety-critical systems, and safety violations of autonomous vehicles (AVs) in actual traffic will cause huge losses. Therefore, adequate testing must be performed before ADS is deployed on real roads. Simulation testing is very important for discovering security violations of ADS. This paper proposes a multi-objective search-based testing framework called MOSAT, which constructs diverse and adversarial driving environments to reveal safety violations of ADS. Specifically, based on atomic driving actions, MOSAT introduces pattern patterns, which describe a series of driving action sequences that can effectively challenge ADS. MOSAT constructs test scenarios through atomic actions and pattern patterns, and uses a multi-objective genetic algorithm to search for adversarial and diverse test scenarios. In addition, in order to comprehensively test the performance of ADS during long-mileage driving, we designed a novel continuous simulation testing technology by running multiple scenarios generated in the simulator by parallel search processes at the same time, and continuously performing different perturbations on the ADS . We demonstrated MOSAT on the Baidu Apollo industrial-grade platform and experimentally proved that MOSAT can effectively generate security-critical scenarios that cause ADS to collapse and expose 11 different types of security violations in a short period of time. It also surpassed the performance of existing technologies by detecting 6 more different safety violations on the same road.

Paper link: https://doi.org/10.1145/3540250.3549100

21、Minerva: browser API fuzzing with dynamic mod-ref analysis

Browser APIs are essential to the modern web experience. Due to their large number and complexity, they greatly expand the browser's attack surface. To detect vulnerabilities in these APIs, fuzzers generate a large number of test cases for random API calls. However, the large search space formed by arbitrary API combinations hinders their effectiveness: since randomly selected API calls are unlikely to interfere with each other (i.e., computed on partially shared data), interesting API interactions are rarely explored. Therefore, reducing the search space by revealing the relationships between APIs is a significant challenge in browser fuzzing. We propose Minerva, an efficient fuzzing tool for browser API vulnerability detection. The key idea is to leverage API interference relationships to reduce redundancy and improve coverage. Minerva includes two modules: dynamic modification reference analysis and guided code generation. Before fuzz testing begins, the dynamically modified reference analysis module builds an API interference graph. It starts by automatically identifying individual browser APIs from the browser's code base. Next, it instrumented the browser to dynamically collect modified reference relationships between APIs. During the fuzz testing process, the guided code generation module synthesizes highly relevant API calls based on modified reference relationships. We evaluated Minerva on three major browsers: Safari, FireFox, and Chromium. Compared to state-of-the-art fuzz testing tools, Minerva improves boundary coverage by 19.63% to 229.62% and finds 2x to 3x more unique vulnerabilities. In addition, Minerva discovered 35 previously unknown vulnerabilities, 20 of which have been fixed, assigned 5 CVE numbers, and recognized by the browser vendors.

Paper link: https://doi.org/10.1145/3540250.3549107

22、On the vulnerability proneness of multilingual code

Using multiple languages to build software has long been the norm, yet whether multilingual code building has important security implications and actual security consequences remains unclear. This paper aims to answer this question by conducting a large-scale study of popular multilingual projects on GitHub and their evolutionary history. We achieve this goal using a novel multilingual code characterization technique. We found a statistically significant correlation between multilingual code's vulnerability to vulnerabilities (both general vulnerabilities and specific classes of vulnerabilities) and its language choice. We also found that this association was related to the association of language interface mechanisms rather than to the association of individual languages. We validate our statistical results with in-depth real-world vulnerability case studies, explained through mechanism and language choices. Our findings call for immediate action to assess and defend against multilingual vulnerabilities and provide practical recommendations.

Paper link: https://doi.org/10.1145/3540250.3549173

23、RoboFuzz: fuzzing robotic systems over robot operating system (ROS) for finding correctness bugs

Robotic systems are becoming an integral part of human life. In response to the increased demand for robot production, Robot Operating System (ROS), as an open source middleware suite, is receiving increasing attention by providing practical tools and libraries for rapid robot development. In this paper, we focus on a relatively rarely tested class of errors in ROS and ROS-based robotic systems, called semantic correctness errors, which include specification violations, physical law violations, and cyber-physical inconsistencies. These bugs often originate from the cyber-physical nature of robotic systems, where noisy hardware components are intertwined with software components, and therefore cannot be detected by existing fuzzing methods that primarily focus on finding memory-safe bugs.

We propose RoboFuzz, a feedback-driven fuzzing framework integrated with ROS and capable of testing correctness bugs. RoboFuzz has the following features: (1) Data type-aware mutation, effectively stress testing data-driven ROS systems; (2) Hybrid execution, obtaining robot states from the real world and simulator, capturing unanticipated cyber physics Inconsistency; (3) a predefined correctness rule judger that identifies correctness errors by checking the execution status and comparing it with the predefined correctness rules; (4) a semantic feedback engine to provide enhanced guidance for the input mutator , supplementing traditional feedback based on code coverage. For distributed data-driven robots, traditional feedback methods are not effective enough. By encoding the correctness invariants of ROS and four ROS-compatible robotic systems into specialized discriminators, RoboFuzz detected 30 previously unknown bugs, 25 of which have since been identified and six fixed.

Paper link: https://doi.org/10.1145/3540250.3549164

24、SEDiff: scope-aware differential fuzzing to test internal function models in symbolic execution

Symbolic execution has become a fundamental program analysis technique. When performing symbolic execution, you will inevitably encounter internal functions (such as library functions) that provide basic operations (such as string processing).

Many symbolic execution engines build internal function models to abstract function behavior for the sake of scalability and compatibility. Due to the high complexity of building models, developers intentionally abstract only part of the functional behavior, i.e. the modeled functionality.

The correctness of the internal function model is critical as it will affect all applications of symbolic execution, such as vulnerability detection and model checking.

A simple solution for correctness testing of internal function models is to cross-check that the model behaves as expected from its corresponding original function implementation. However, such solutions often detect overwhelming inconsistencies related to unmodeled features, which are beyond the scope of the model and therefore considered false reports.

We believe that a sound testing approach should only target the functionality that the developer intends to model. However, automatically identifying the scope of the functionality being modeled is a significant challenge.

In this paper, we propose a range-oriented difference testing framework SEDiff to solve this problem. We design a novel algorithm that automatically maps modeled functionality to code in the original implementation. SEDiff then applies range-oriented gray-box diff fuzzing to the relevant code in the original implementation. It also comes with a novel range-oriented input generator and a custom error checker that efficiently and correctly detects erroneous inconsistencies. We conducted an extensive evaluation of several popular real-world symbolic execution engines for binary, web, and kernel. Our manual investigation showed that SEDiff accurately identified modeled functions and detected 46 new vulnerabilities in the internal function model used in the symbolic execution engine.

Paper link: https://doi.org/10.1145/3540250.3549080

25、Security code smells in apps: are we getting better?

Users increasingly rely on mobile applications for daily tasks, including tasks involving security and privacy such as online banking, e-health, and e-government. Additionally, numerous sensors capture users’ movements and habits for fitness tracking and convenience. Although laws and regulations impose requirements and restrictions on the processing of privacy-sensitive data, users must still trust application developers to provide adequate protection. In this paper, we investigate the security status of Android applications and the development of security-related code smells since the introduction of the Android operating system.

By analyzing 300 apps from the Google Play Store for each year between 2010 and 2021, we found that the number of code scanner discoveries per thousand lines of code decreased over time. However, this development trend is offset by the increase in code size. Applications are increasingly being discovered, indicating a reduced overall security level. This trend stems from flawed use of encryption, unsafe compiler flags, unsafe use of WebView components, and unsafe use of language features such as reflection. Based on our data, we advocate for tighter controls on apps before they hit the store.

Paper link: https://doi.org/10.1145/3540250.3549091

26、Software security during modern code review: the developer’s perspective

To avoid software vulnerabilities, organizations are moving security to earlier stages of software development, such as during code reviews. In this article, we aim to understand developers’ perspectives when assessing software security during code reviews, the challenges they encounter, and the support provided by companies and projects. To do this, we conducted a two-step survey: We interviewed 10 professional developers and surveyed 182 practitioners about software security assessment during code reviews. The findings are an overview of developers' perceptions of software security during code reviews and a set of identified challenges. Our research found that most developers do not immediately report concerns about security issues during code reviews. Only after being asked about software security did developers say they always considered security during the review process and recognized its importance. Most companies don't provide security training but expect developers to still ensure security during reviews. As a result, developers reported lack of training and security knowledge as the main challenges they face when checking for security issues. Additionally, they face challenges in handling third-party libraries and identifying interactions between parts of the code that may have security implications. Additionally, security can be overlooked during the review process due to developers' assumptions about the security dynamics of the applications they develop.

Preprint: https://arxiv.org/abs/2208.04261

Data and materials: https://doi.org/10.5281/zenodo.6969369

Paper link: https://doi.org/10.1145/3540250.3549135

27、SolSEE: a source-level symbolic execution engine for solidity

Most existing smart contract symbolic execution tools perform analysis on bytecode, which results in the loss of high-level semantic information in the source code. This makes interactive analysis tasks such as visualization and debugging very challenging and significantly limits the usability of the tool. In this article, we introduce SolSEE, a source-level symbolic execution engine for Solidity smart contracts. We describe the design of SolSEE, highlight its key features, and demonstrate its usage via a web-based user interface. SolSEE has advantages over other existing source code level analysis tools in terms of the advanced Solidity language features it supports and its analysis flexibility. A demo video can be viewed at the following link: https://sites.google.com/view/solsee/.

Paper link: https://doi.org/10.1145/3540250.3558923

28、Tracking patches for open source software vulnerabilities

Open source software (OSS) vulnerabilities threaten the security of software systems using OSS. The vulnerability database provides valuable information on mitigating OSS vulnerabilities (e.g., vulnerable versions and patches). There is growing concern about the information quality of vulnerability databases. However, the quality of patches in existing vulnerability databases is unclear; and existing patch tracking methods based on manual or heuristic methods are either too expensive or too specific to be applied to all OSS vulnerabilities.

To address these issues, we first conduct an empirical study to understand the quality and characteristics of OSS vulnerability patches in two industrial vulnerability databases. Inspired by our research, we propose Tracer, the first automated method for tracking patches for OSS vulnerabilities from multiple knowledge sources. Our evaluation results show that: i) compared to heuristic-based methods, Tracer can track up to 273.8% of vulnerability patches and achieve a higher score on F1 score, up to 116.8%; ii) Tracer can complement industrial vulnerability databases . Our evaluation also demonstrates Tracer's generalizability and practicality.

Paper link: https://doi.org/10.1145/3540250.3549125

29、VulCurator: a vulnerability-fixing commit detector

The open source software (OSS) vulnerability management process is very important in today's era as the number of OSS vulnerabilities discovered increases over time. Monitoring vulnerability fix submissions is part of the standard process to prevent vulnerability exploitation. However, manually detecting vulnerability fix submissions is time-consuming due to the potentially large number of submissions that need review. Recently, many techniques have been proposed to use machine learning to automatically detect vulnerability remediation submissions. These solutions either do not use deep learning or only use limited information sources for deep learning. This paper presents VulCurator, a deep learning tool that leverages richer information sources, including commit messages, code changes, and issue reports, for bug fix commit classification. Our experimental results show that VulCurator outperforms state-of-the-art baselines in terms of F1 score by up to 16.1%.

The VulCurator tool is publicly available at: https://github.com/ntgiang71096/VFDetector and https://zenodo.org/record/7034132#.Yw3MN-xBzDI, with a demo video: https://youtu. be/uMlFmWSJYOE.

Paper link: https://doi.org/10.1145/3540250.3558936

30、VulRepair: a T5-based automated software vulnerability repair

As the number and complexity of software vulnerabilities increases, researchers have proposed various artificial intelligence (AI)-based methods to help security analysts with limited resources discover, detect, and locate vulnerabilities. However, security analysts still need to spend a lot of effort to manually fix these vulnerable functions. Recent research has proposed an automatic vulnerability repair method based on NMT, but it is still far from perfect due to various limitations. In this paper, we propose VulRepair, a T5-based automated software vulnerability repair method that leverages pre-training and BPE components to address various technical limitations in previous work. Through extensive experiments on 8,482 bug fixes from 1,754 real software projects, we find that our VulRepair achieves 44% perfect predictions and is 13%-21% more accurate than competing baseline methods. These results lead us to conclude that our VulRepair is more accurate than both baseline methods, highlighting significant progress in NMT-based automated vulnerability repair. Our additional investigation also showed that our VulRepair can accurately fix 1,706 real-world as many as 745 well-known vulnerabilities (e.g. Use After Free, incorrect input validation, operating system command injection), demonstrating the effectiveness of our VulRepair. Generates practicality and importance in vulnerability remediation to help security analysts with limited resources remediate vulnerabilities.

Paper link: https://doi.org/10.1145/3540250.3549098

31、You see what I want you to see: poisoning vulnerabilities in neural code search

Programming efficiency can be greatly improved in the search and reuse of open source software code snippets based on natural language queries. Recently, deep learning-based methods have become increasingly popular in the field of code search. Despite substantial progress in training accurate code search models, little attention has been paid to the robustness of these models.

This article aims to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep learning-based code search models? If so, can we detect contaminated data and remove these backdoors? This study conducts research and development on a series of backdoor attacks on code search models based on deep learning models through data pollution. We first demonstrate the vulnerability of existing models to backdoor attacks based on data contamination. We then introduce a simple yet effective attack method to attack neural code search models by polluting the corresponding training dataset.

Additionally, we demonstrate that the attack can affect the ranking of code search results by adding several specially crafted source code files to the training corpus. We show that this type of backdoor attack is effective against several representative deep learning-based code search systems and can successfully manipulate the ranked list of search results. Taking the bidirectional RNN-based code search system as an example, given a query containing an attack target word (such as "file"), the normalized ranking of the target candidates can be significantly improved from the top 50% to the top 4.43%. To defend against this attack, we empirically examine an existing common defense strategy and evaluate its performance. Our results show that the explored defense strategies do not yet work in backdoor attacks on our proposed code search system.

Paper link: https://doi.org/10.1145/3540250.3549153

Top Academic Conference on Software Engineering - ESEC/FSE 2022 Topics (Network Security Direction) List, Abstract and Summary

Guess you like