The top academic conference on software testing - ISSTA 2023 paper (network security direction) list, abstract and summary

78c3dc7e3d8ffe4310e7f01ca9a185c6.png

Summarize

  1. This conference covers a wide range of security research topics, including source code analysis, binary code analysis, malware detection, vulnerability detection, fuzz testing, program verification, etc.

  2. Some popular research directions include: vulnerability detection based on machine learning, application of large language models in software security, and blockchain smart contract security analysis. These directions have continued to develop in recent years.

  3. Some of the less popular research directions include: embedded firmware security, autonomous driving system security testing, and multimedia content review software verification. The systems and application scenarios studied in these directions are relatively professional.

  4. Some directions worthy of in-depth research include: improving the automation of vulnerability repairs, enhancing the security of deep learning systems, and improving the interpretability of graph neural network models. These directions are all related to improving the reliability and explainability of automated software security technologies.

6affbe72222f7ef73e7cea6898379a94.png

1、1dFuzz: Reproduce 1-Day Vulnerabilities with Directed Differential Fuzzing

1-day vulnerabilities are common in practice and pose a serious threat to end users because attackers can learn and exploit these vulnerabilities through released patches. Reproducing 1-day vulnerabilities is also very important for defenders, for example, to block attack traffic targeting 1-day vulnerabilities. A core question affecting the effectiveness of identifying and triggering 1-day vulnerabilities is what are the unique characteristics of security patches. By conducting a large-scale empirical study, we point out that a common and unique feature of patches is tail call sequences (TCS), and in this paper propose a new guided differential fuzzing solution, 1dFuzz, to efficiently reproduce 1 -day vulnerability. Based on TCS features, we propose a locator 1dLoc that can find candidate patch locations through static analysis, a TCS-based directed fuzzy distance measurement method, and a PoC that can capture 1-day vulnerabilities during fuzz testing. New disinfectant 1dSan. We conducted a systematic evaluation of 1dFuzz on a set of real-world software vulnerabilities in 11 different settings. Results show that 1dFuzz significantly outperforms state-of-the-art baselines and is able to find up to 2.26x more 1-day vulnerabilities in 43% less time.

Paper link: https://doi.org/10.1145/3597926.3598102

f39d4d6522a9e978eba03df8201cff99.png

2、A Bayesian Framework for Automated Debugging

Debugging takes up most of a developer's time. Therefore, automatic debugging techniques, including fault localization (FL) and automatic program repair (APR), have attracted widespread attention for their potential in assisting developers in debugging tasks. With recent advances in techniques that treat these two tasks as tightly coupled, such as unified debugging, a framework that formally expresses these two tasks will improve our understanding of automatic debugging and provide a means for formal analysis techniques and methods. way. To this end, we propose a Bayesian framework for understanding automatic debugging. We find that a Bayesian framework along with an explicit statement of the automatic debugging goals can recover the largest fault localization formulas from previous work and analyze existing APR techniques and their underlying assumptions.

To demonstrate our framework through an empirical approach, we further propose BAPP, a Bayesian patch prioritization technique that combines intermediate program values ​​to analyze possible patch locations and repair operations, whose core equations are derived from our Bayesian framework Come. We found that introducing program values ​​allowed BAPP to more accurately identify the correct patches: BAPP produced ranking results that reduced the number of patch evaluations by 68%, thus reducing repair time by an average of 34 minutes. Furthermore, our Bayesian framework suggests some changes in the way fault localization information is used in program repair, which we verified to be useful for BAPP. These results highlight the potential of valuable automated debugging techniques and further validate our theoretical framework.

Paper link: https://doi.org/10.1145/3597926.3598103

97677a1c2e0c1399f25ab784892a6b24.png

3、API2Vec: Learning Representations of API Sequences for Malware Detection

Malware analysis based on API call sequences is an effective method because the sequences reflect the dynamic execution behavior of malware. Recent advances in deep learning allow these techniques to be applied to mine useful information from sequences of API calls. However, these methods mainly operate on raw sequences and may not effectively capture important information, especially for multi-process malware, mainly due to the issue of API call interleaving.

Inspired by this, this paper proposes API2Vec, a graph-based API embedding method for malware detection. First, we build a graph model to represent the original sequence. In particular, we design the Time Process Graph (TPG) to simulate inter-process behavior, and the Time API Graph (TAG) to simulate intra-process behavior. Using these graphs, we design a heuristic random walk algorithm that generates paths that can capture fine-grained malware behavior. By pre-training these paths using the Doc2Vec model, we are able to generate embeddings of paths and APIs for malware detection. Experiments on a real malware dataset demonstrate that API2Vec outperforms state-of-the-art embedding and detection methods in terms of accuracy and robustness, especially for multi-process malware.

Paper link: https://doi.org/10.1145/3597926.3598054

31fd45c1b31375b79dc078cc97382665.png

4、An Empirical Study on the Effects of Obfuscation on Static Machine Learning-Based Malicious JavaScript Detectors

With the increase in cyber-attacks and the expensive cost of manual identification, machine learning is increasingly used for malicious JavaScript detection. In actual applications, malicious scripts and benign scripts often obfuscate their own codes before uploading in order to hide malicious behaviors or protect intellectual property rights. Obfuscation, while beneficial, also introduces some additional code characteristics (such as dead code) into the code. When using machine learning to learn a malicious JavaScript detector, these additional features can affect the model, making it less effective. However, there is currently a lack of clear understanding of the robustness of existing machine learning-based detectors on different obfuscators.

In this paper, we conduct the first empirical study to clarify the impact of obfuscation on static feature-based machine learning detectors. Through the experimental results, we observed several findings: 1) Obfuscation has a significant impact on the effectiveness of the detector, causing both the false positive and false negative rates to increase, and the bias of obfuscation in the training set makes it easier for the detector to detect Confusion rather than malicious behavior. 2) Common measures, such as improving the quality of the training set by adding relevant confusion samples and utilizing state-of-the-art deep learning models, do not achieve good results. 3) The fundamental reason for the impact of obfuscation on these detectors is that the feature space they use can only reflect shallow differences in code, but not the essential differences between benign and malicious, which are easily affected by differences introduced by obfuscation. 4) Obfuscation has a similar impact on the reality detector in VirusTotal, indicating that this is a common practical problem.

Paper link: https://doi.org/10.1145/3597926.3598146

709d293766061ba0b8e9b9eeedd96e75.png

5、Automated Generation of Security-Centric Descriptions for Smart Contract Bytecode

Smart contract and decentralized application (DApp) users face significant risks because they lack the necessary knowledge to avoid using contract code that is vulnerable and malicious. This paper proposes Tx2TXT, a novel system that automatically generates security-focused textual descriptions from smart contract bytecode. To capture the security aspects of financial applications, we formally define a funds transfer graph that models key fund flows in smart contracts. To ensure the expressiveness and simplicity of the descriptions derived from these graphs, we employ a graph convolutional network (GCN) based model to identify safety-related conditional statements and selectively add them to our graphs in the model. To convert low-level bytecode instructions into readable text scripts, we leverage robust API signatures to restore the semantics of the bytecode. We evaluated Tx2TXT on 890 well-tagged vulnerable, malicious, and secure contracts with developer-written descriptions. Our results show that Tx2TXT outperforms existing solutions and effectively helps end users avoid risky contracts.

Paper link: https://doi.org/10.1145/3597926.3598132

4ab54afff906a1e06a75c6ba2c0109d7.png

6、Automated Program Repair from Fuzzing Perspective

This paper proposes a new approach to connect two closely related topics: fuzz testing and automated program repair (APR). The article is divided into two parts. The first part describes the similarities between fuzz testing and APR, both of which can be viewed as a search problem. The second part introduces a new patch scheduling algorithm called Casino, which is designed from the perspective of fuzz testing to improve search efficiency. Our experiments show that Casino outperforms existing algorithms. We are also promoting open science by sharing SimAPR, a simulation tool that can be used to evaluate new patch scheduling algorithms.

Paper link: https://doi.org/10.1145/3597926.3598101

0054987c53535efddbcfda30d27926c0.png

7、Automatic Testing and Benchmarking for Configurable Static Analysis Tools

Static analysis is an important tool for detecting bugs in real-world software. The emergence of numerous analysis algorithms with their own trade-offs has led to a proliferation of configurable static analysis tools, but their complex and undertested configuration spaces have been a barrier to widespread adoption. To improve the reliability of these tools, my research focuses on developing new methods to automatically test and debug them. First, I describe an empirical study that helps us understand the performance and behavior of configurable taint analysis tools for Android. The results of this research inspired the development of ECSTATIC, a testing and debugging framework that goes beyond taint analysis and can be used to test any configurable static analysis tool. Next steps in this research involve the automated creation of real-world benchmarks related to static analysis, the ground truth of the relevant benchmarks, and the analysis characteristics.

Paper link: https://doi.org/10.1145/3597926.3605232

d8dbe62b1d5e160cb6955beb2a8a2d47.png

8、Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning

In the process of resolving issues submitted by users via bug reports, Android developers attempt to reproduce and observe the crash conditions described in the bug reports. Due to the low quality of bug reports and the complexity of modern applications, the reproduction process is not simple and time-consuming. Therefore, automated methods that can help reproduce Android bug reports are urgently needed. However, current methods used to help developers automatically reproduce bug reports can only handle limited forms of natural language text, and it is difficult to successfully reproduce a crash when there are missing or imprecise steps in the initial bug report. In this paper, we introduce a new fully automated method for reproducing crashes from Android bug reports and address these limitations. Our approach works by leveraging natural language processing techniques to analyze natural language in Android bug reports more comprehensively and accurately, and designing new techniques based on reinforcement learning to guide the search for successful reproduction steps. We conduct an empirical evaluation on 77 real-world bug reports. Our method achieves 67% precision and 77% recall in accurately extracting reproducible steps in bug reports, reproducing 74% of overall bug reports, and reproducing 64% of bug reports containing missing steps. , which is significantly better than the current technical level in performance.

Paper link: https://doi.org/10.1145/3597926.3598066

53f75b819f5a76bc20a0e02be229cba2.png

9、BehAVExplor: Behavior Diversity Guided Testing for Autonomous Driving Systems

Testing automated driving systems (ADS) is a critical task to ensure the reliability and safety of autonomous vehicles. Existing methods mainly focus on finding security violations but ignore the diversity of generated test cases, which may produce many redundant test cases and failures. Such redundant failures degrade test performance and increase failure analysis costs. This paper proposes a novel behavior-guided fuzz testing technology (BehAVExplor) to explore different behaviors of self-vehicles controlled by automated driving systems (ADS) (i.e., the vehicles that ADS is testing) and detect various violations. Specifically, we design an efficient unsupervised model, called BehaviorMiner, to describe the behavior of the self-vehicle. BehaviorMiner extracts temporal features from a given scene and performs clustering-based abstraction to group behaviors with similar features into abstract states. If a new test case triggers new behavior (for example, covering a new abstract state), it will be added to the seed corpus. Due to the potential conflict between behavioral diversity and general violation feedback, we further propose an energetic mechanism to guide seed selection and mutation. The energy of a seed quantifies how good it is. We evaluated BehAVExplor on the Apollo industrial-grade ADS and LGSVL simulation environments. Empirical evaluation results show that BehAVExplor is able to effectively discover more diverse violations than existing techniques.

Paper link: https://doi.org/10.1145/3597926.3598072

b04301daab847e7bcfc21f94fc617b40.png

10、Beware of the Unexpected: Bimodal Taint Analysis

Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking is particularly capable of finding vulnerabilities caused by complex data flows through multiple functions. However, determining exactly which flows are problematic is challenging and sometimes relies on factors beyond the scope of pure program analysis, such as conventions and informal knowledge. For example, it is surprising and potentially problematic to learn that parameter names for API functions locale end up in file paths. In contrast, it is completely unsurprising to find that the parameter command passed to the API function execaCommand is ultimately interpreted as part of the operating system command. This article introduces a bimodal taint analysis method called Fluffy, which combines static analysis (reasoning about data flows) and machine learning (probabilistically determining which flows may be problematic). The key idea is to have the machine learning model predict whether the flow is expected or unexpected from the natural language information involved in the tainted flow (such as the API name), and only report the latter to the developer. We propose a general framework and instantiate it using four learning models that offer different trade-offs between the need to annotate training data and prediction accuracy. We implemented Fluffy on top of the CodeQL analytics framework and applied it to 250,000 JavaScript projects. When evaluated against five common vulnerability types, we found that Fluffy achieved an F1 score of 0.85 or higher for four of them on various datasets.

Paper link: https://doi.org/10.1145/3597926.3598050

9db997e41afaa6661a18ab73ead38c49.png

11、Beyond “Protected” and “Private”: An Empirical Security Analysis of Custom Function Modifiers in Smart Contracts

Smart contracts are application layer code that runs on the blockchain ledger and provide programming logic by executing predefined functions based on transactions. By default, smart contract functions can be called by any party. In order to protect them, taking Solidity, the popular smart contract language of the Ethereum blockchain, as an example, a unique language-level keyword "modifier" is proposed to allow developers to define custom function access control policies, beyond those in traditional programming languages. "protected" and "private" modifiers.

This article aims to conduct a large-scale security analysis of modifiers used in real-world Ethereum smart contracts. To achieve this goal, we design and implement SoMo, a novel smart contract analysis tool. Its main goal is to identify insecure modifiers that can be bypassed from one or more unprotected smart contract functions. This is challenging due to the complex relationship between the modifier and its variables/functions and the uncertainty of the entry functions accessible to the attacker. To overcome these problems, we first propose a new structure, Modifier Dependency Graph (MDG), to connect all modifier-related control/data flows. On MDG, we model system variables, generate symbolic path constraints, and iteratively test each candidate entry function. Our extensive evaluation results show that SoMo outperforms the state-of-the-art SPCon tool in detecting all true positives and correctly avoiding 11 false positives. It also achieved a high accuracy of 91.2% when analyzing a large dataset of 62,464 contracts, with over 400 modifiers identified as bypassable. Our analysis further revealed three interesting security findings about modifiers and nine major ways in which modifiers are used. SoMo is integrated into the online security scanning service MetaScan.

Paper link: https://doi.org/10.1145/3597926.3598125

f91d81c19f39ef7d6ff25b71c0b13f6d.png

12、Building Critical Testing Scenarios for Autonomous Driving from Real Accidents

One of the goals of the development and popularization of autonomous driving technology is to reduce traffic accidents caused by human factors. However, recent data on fatal crashes involving automated driving systems (ADS) shows that this important goal has not yet been achieved. Therefore, new requirements are put forward for more comprehensive and targeted testing of safe driving. This paper proposes a method to automatically construct key test scenarios from real accident data. First, we propose a new model called M-CPS (Multi-channel Panoramic Segmentation) to extract effective information from accident records (such as images or videos) and separate the independent individuals of different traffic participants to Further restoration of the scene. Compared with traditional panoramic segmentation models, the M-CPS model can effectively handle segmentation challenges that exist in accident records due to issues such as shooting angle, image quality, pixel overlap, etc. Then, the extracted core information is connected with the virtual test platform to generate the original scene set. In addition, we designed a mutation testing solution based on the original set of scenarios, thus greatly enriching the scenario library for testing. In experiments, the M-CPS model achieved a PQ result of 66.1% on the CityScapes test set, demonstrating that our model performance only fluctuates slightly compared to the best baseline model on the pure panoramic segmentation task. On the SHIFT dataset, the semantic segmentation branch achieved an IoU result of 84.5%, and the instance segmentation branch achieved an mAP result of 40.3%. Then, we generated original and mutated scene sets using UCF-Crime, CADP, and US-Accidents datasets. These generated sets of scenarios were connected to Apollo and Carla simulation platforms and used to test ADS prototypes. We found three types of scenarios that resulted in accidents with ADS prototypes, indicating flaws in existing ADS prototypes. Our solution provides a new possible direction for the recovery of key scenarios in ADS testing and can improve efficiency in related fields.

Paper link: https://doi.org/10.1145/3597926.3598070

9455c29ceba0ae87f540a5033e12f065.png

13、CGuard: Scalable and Precise Object Bounds Protection for C

Space security violations are the root cause of many security attacks and unexpected behavior of applications. Existing techniques for enforcing space safety generally work broadly at object or pointer granularity. Object-based approaches tend to incur high CPU overhead, while pointer-based approaches incur high CPU and memory overhead.

SGXBounds is an object-based approach that provides lower overhead to accurately protect objects from out-of-bounds access than other tools with similar precision. However, a major drawback of this approach is that it cannot support address spaces larger than 32 bits.

In this article, we introduce CGuard, a tool that provides precise object boundary protection for C applications with an overhead comparable to SGXBounds without limiting the application's address space. CGuard stores boundary information before the object's base address and uses the extra bits of the virtual address available in the x86_64 architecture to encode the relative offset of the base address. For objects that cannot fit in the extra bits, CGuard uses a custom memory layout that enables it to find the object's base address in only one memory access. Our research revealed space security violations in the gcc and x264 benchmarks in the SPEC CPU2017 benchmark suite and the string_match benchmark in the Phoenix benchmark suite. The SPEC CPU2017 and Phoenix benchmark suites had execution time overheads of 42% and 26% respectively, while the Apache webserver throughput was reduced by 30% when the CPU was fully saturated. These results show that CGuard can be very effective while maintaining reasonable efficiency levels.

Paper link: https://doi.org/10.1145/3597926.3598137

5f18e1e5146d49033cf7d75c718a6d7e.png

14、CONCORD: Clone-Aware Contrastive Learning for Source Code

Over the past few years, deep learning (DL) models have shown great potential for analyzing source code. Recently, self-supervised pre-training has received attention for learning universal code representations, which is valuable for many downstream software engineering tasks such as cloning and bug detection.

While previous studies have successfully learned from different code abstractions (e.g. markup, AST, graphs), we argue that learning a universal representation must also take into account the way developers code on a daily basis. On the one hand, human developers tend to write repetitive programs, referencing existing code snippets from the current code base or online resources (such as the Stack Overflow website) rather than implementing functions from scratch; this behavior leads to a large number of code clones. Conversely, a faulty clone may trigger malicious program behavior.

Therefore, as a proxy for incorporating developer coding behavior into pre-training schemes, we propose to include code clones and their anomalous variants. Specifically, we propose CONCORD, a self-supervised pre-training strategy that places benign clones closer in the representation space while pushing aberrant variants farther away. We demonstrate that CONCORD’s clone-aware pre-training significantly reduces expensive pre-training resource requirements while improving the performance of downstream software engineering tasks. We also empirically demonstrate that CONCORD is able to improve existing pre-trained models, learn better representations, and thus become more efficient at identifying semantically equivalent programs and distinguishing buggy from bug-free code.

Paper link: https://doi.org/10.1145/3597926.3598035

2d01f12df8586934994983f231757d2d.png

15、Catamaran: Low-Overhead Memory Safety Enforcement via Parallel Acceleration

Memory safety issues are inherent diseases of C/C++ programs.

Dynamic memory safety enforcement has advantages in terms of efficiency as the primary method, but suffers from excessive runtime overhead.

Existing attempts to reduce overhead are either labor intensive, rely heavily on specific hardware/compiler support, or are ineffective.

This paper proposes a novel technique to reduce time overhead by executing dynamic inspection code in parallel.

We leverage static dependency analysis and dynamic profit analysis to identify and schedule potential code to run concurrently in independent threads. We implemented a tool called Catamaran and evaluated it on a rich set of benchmarks.

Experimental results verify that Catamaran can significantly reduce the runtime overhead of existing dynamic tools without losing the functionality of memory safety enforcement.

Paper link: https://doi.org/10.1145/3597926.3598098

76922b1690bc36b9b6416dc7f172af51.png

16、CodeGrid: A Grid Representation of Code

Code representation is a critical step in applying artificial intelligence to software engineering. General-purpose natural language processing representations are effective but do not take full advantage of the rich structure inherent in code. Recent research has focused on extracting abstract syntax trees (ASTs) and integrating their structural information into code representations. These enhanced AST representation methods promote technological progress and accelerate new applications of artificial intelligence in software engineering. However, AST ignores important control and data flow aspects of code structure, leaving some potentially relevant code signals unexploited. For example, purely image-based representations perform almost as well as AST-based representations, although they must learn to recognize tokens, let alone their semantics. This result is strong evidence from previous work that there is still room for improvement in these new code representation methods; it also raises the question of exactly what signals are exploited by image-based methods. We answered this question. We show that code is spatial in nature and exploit this fact to propose a new representation that embeds tokens in a grid that preserves the layout of the code. Unlike some methods in the state of the art, this approach is downstream task agnostic: it enhances the learning algorithm with spatial signals, regardless of whether the task is generation or classification. For example, we demonstrate that convolutional neural networks (CNN), an inherently spatially aware model, can leverage the output to effectively solve fundamental software engineering tasks such as code classification, code clone detection, and vulnerability detection. PixelCNN leverages grid representation to implement code completion. We demonstrate our spatial code hypothesis through extensive experiments and quantify the model's performance while preserving changes in grid extent. To demonstrate its generalizability, we show how to enhance the model to improve its performance on a range of tasks. In terms of clone detection, the F1 score increased by 3.3% compared to ASTNN.

Paper link: https://doi.org/10.1145/3597926.3598141

038c64032ad3baa8392b9e8439de7d38.png

17、CoopHance: Cooperative Enhancement for Robustness of Deep Learning Systems

For deep learning systems, adversarial attacks have always been a threat that requires attention. Adversarial attacks can cause deep learning systems to misbehave by adding human-imperceptible perturbations to harmless inputs. Considering the popularity of deep learning systems in industry, developers need to urgently take steps to make deep learning systems more robust to adversarial attacks.

In this study, we propose a new enhancement technique called CoopHance. CoopHance leverages two specifically customized components, the Regulator and the Inspector, to synergistically enhance the robustness of deep learning systems to adversarial examples with different distortions. The Regulator can purify adversarial examples with low or moderate distortion, while the Inspector is responsible for detecting adversarial examples with high distortion by capturing the abnormal state of the deep learning system. We evaluate using various attacks, and the results show that, on average, CoopHance can successfully defend against 90.62% and 96.56% of adversarial examples generated by unprotected systems on the CIFAR-10 and SVHN datasets, respectively, better than those including Feature Squeeze, LID Five state-of-the-art enhancement technologies, including SOAP, Adversarial Training and MagNet, are more effective, with an effect improvement of 188.14%. Meanwhile, when an attacker generates new adversarial examples on the augmented system, CoopHance can reject 78.06% of the attacks, an average improvement of 82.71% over the best of the five augmentation techniques.

Paper link: https://doi.org/10.1145/3597926.3598093

f518f25f347737717d43d1594ab77185.png

18、DeFiTainter: Detecting Price Manipulation Vulnerabilities in DeFi Protocols

DeFi protocols are programs that manage high-value digital assets on the blockchain. Price manipulation vulnerabilities are one of the common vulnerabilities in DeFi protocols, which allow attackers to obtain excessive profits by manipulating token prices. This paper proposes DeFiTainter, a cross-contract taint analysis framework for detecting price manipulation vulnerabilities. DeFiTainter features two innovative mechanisms to ensure its effectiveness. The first mechanism is to build a call graph for cross-contract taint analysis by recovering call information, including not only code constants, but also contract storage and function parameters. The second mechanism is high-level semantic induction for detecting price manipulation vulnerabilities, accurately identifying taint sources and sinks, and tracking taint data across contracts. Extensive evaluations on real-world events and high-value DeFi protocols show that DeFiTainter outperforms existing methods and achieves state-of-the-art performance in detecting price manipulation vulnerabilities, with a precision of 96% and a recall of 91.3%. Additionally, DeFiTainter revealed three previously undisclosed price manipulation vulnerabilities.

Paper link: https://doi.org/10.1145/3597926.3598124

232d4124425a03d506dd4f015954332a.png

19、DeUEDroid: Detecting Underground Economy Apps Based on UTG Similarity

In recent years, the underground economy has flourished in mobile systems. These underground economy applications (UEware for short) make profits by providing illegal services, especially in sensitive areas (such as gambling, pornography, loans). Unlike traditional malware, most of them (more than 80%) do not have malicious payloads. Due to their unique characteristics, existing detection methods are unable to effectively and efficiently deal with this emerging threat.

To address this problem, we propose a novel approach to detect UEware effectively and efficiently by considering interface transition graphs (UTGs). Based on this approach, we designed and implemented a system named DeUEDroid for detection. To evaluate DeUEDroid, we collected 25,717 applications and built the first large-scale UEware real-world dataset (1,700 applications). Evaluation results based on real datasets show that DeUEDroid can cover new interface features and statically construct accurate UTG. Its detection F1 score reaches 98.22%, and its classification accuracy reaches 98.97%, which is significantly better than the performance of traditional methods. Evaluation results involving 24,017 applications show UEware's detection effectiveness and efficiency in real-life scenarios. Additionally, the results show that UEware is prevalent, with 54% of apps in the wild and 11% of apps in app stores being UEware. Our work provides implications for future work in analyzing and detecting UEware. To collaborate with the community, we have made our prototype system and dataset available online.

Paper link: https://doi.org/10.1145/3597926.3598051

337bb3544af6618148bf0fcc19854e2f.png

20、Detecting Condition-Related Bugs with Control Flow Graph Neural Network

Automated error detection is critical to high-quality software development and has received widespread attention over the years. Among various errors, previous research has shown that conditional expressions are very error-prone and condition-related errors are common in practice. Traditional automated error detection methods are often limited to compilable code and require tedious manual work. Recent deep learning-based work tends to learn general syntax features based on abstract syntax trees (AST), or apply existing graph neural networks on program graphs. However, AST-based neural models may miss important control flow information of the source code, while existing graph neural networks for error detection tend to learn local neighborhood structure information. Generally speaking, condition-related errors are highly affected by control flow knowledge, so we propose a control flow graph-based graph neural network (CFGNN) to automatically detect condition-related errors, which includes a graph-structured LSTM unit to efficiently learn control flow knowledge and long-range contextual information.

We also employ an API usage attention mechanism to leverage API knowledge. To evaluate the proposed approach, we collect real-world bugs in popular GitHub repositories and build a large-scale condition-related bug dataset. Experimental results show that our proposed method significantly outperforms existing state-of-the-art methods in detecting condition-related errors.

Paper link: https://doi.org/10.1145/3597926.3598142

46c481495e9eeafd6566ac8199e2d45f.png

21、Detecting State Inconsistency Bugs in DApps via On-Chain Transaction Replay and Fuzzing

Distributed applications (DApps) consist of multiple smart contracts running on the blockchain. As the DApp ecosystem becomes increasingly popular, vulnerabilities in DApps can have significant impacts, such as financial losses. Identifying vulnerabilities in DApps is no easy task, as modern DApps consist of complex interactions between multiple contracts. Due to the lack of precise contextual information to confirm smart contract vulnerabilities when analyzing smart contracts, previous studies either suffered from high false positive rates or high false positive rates. This paper introduces IcyChecker, a new framework based on fuzz testing to effectively identify state inconsistency (SI) vulnerabilities, which is a specific type of vulnerability that may lead to vulnerabilities, such as reentrancy vulnerabilities, complex patterns of upfront transactions. Unlike previous work, IcyChecker utilizes an accurate set of contextual information to fuzz the contract by replaying historical transactions on the chain. Furthermore, instead of designing specific test oracles required by other fuzz testing methods, IcyChecker implements a novel mechanism to mutate a set of fuzz testing transaction sequences and further identify SI vulnerabilities by observing their status differences. An evaluation of IcyChecker on the top 100 popular DApps showed that it effectively identified a total of 277 SI vulnerabilities with an accuracy of 87%. By comparing with other state-of-the-art tools such as Smartian, Confuzzius and Sailfish, we show that IcyChecker not only identifies more SI vulnerabilities but also has lower false positive rates, thanks to its integration of accurate on-chain data and unique fuzz testing strategies. Our research provides new methods for discovering smart contract vulnerabilities in DApps.

Paper link: https://doi.org/10.1145/3597926.3598057

5f27776c012f5b330900ca3c01b090b7.png

22、Detecting Vulnerabilities in Linux-Based Embedded Firmware with SSE-Based On-Demand Alias Analysis

Although the importance of using static taint analysis in Linux-based embedded firmware to detect taint-based vulnerabilities is widely recognized, existing methods suffer from the following major limitations: (a) Existing methods do not properly handle the transition from attacker-controllable sources to Indirect calls between security-sensitive locations lead to a large number of false positives; (b) they use heuristics to identify taint sources, which is not accurate enough, leading to a high false positive rate.

To address these issues, we propose EmTaint, a novel static method for accurate and fast detection of taint-type vulnerabilities in Linux-based embedded firmware. In EmTaint, we first design an on-demand alias analysis technology based on structured symbolic expression. Based on this, we proposed indirect call parsing and accurate taint analysis solutions. Combined with cleaning rule checks, EmTaint can accurately discover a large number of tainted vulnerabilities within a limited time. We evaluated EmTaint on 35 real embedded firmware samples from six well-known vendors. The results showed that EmTaint discovered at least 192 vulnerabilities, including 41 n-day vulnerabilities and 151 0-day vulnerabilities. At the time of writing, at least 115 CVE/PSV numbers have been assigned to some of the reported vulnerabilities. Compared to state-of-the-art tools such as KARONTE and SaTC, EmTaint finds more vulnerabilities in less time on the same dataset.

Paper link: https://doi.org/10.1145/3597926.3598062

700e91bf3a9620c0c1ef42b9970b3847.png

23、EDHOC-Fuzzer: An EDHOC Protocol State Fuzzer

EDHOC is a compact and lightweight authentication key exchange protocol proposed by IETF, and its design focuses on small message sizes to be suitable for constrained IoT communication technologies. In this tool paper, we provide an overview of EDHOC-Fuzzer, a protocol state fuzzer for EDHOC client and server implementations. It employs model learning to generate a state machine model of the EDHOC implementation, capturing its input/output behavior. The model can be used for model-based testing, fingerprinting, or can analyze inconsistencies, state machine errors, and security vulnerabilities. We provide an overview of the architecture and usage of EDHOC-Fuzzer and show some examples of models generated by the tool and our current findings.

Paper link: https://doi.org/10.1145/3597926.3604922

aae4a16cf3417e712fb6de6062eea108.png

24、Enhancing REST API Testing with NLP Techniques

RESTful services are typically documented using the OpenAPI specification. Although many automated testing techniques have been proposed that utilize the machine-readable part of the specification to guide test generation, its human-readable part has been mostly ignored. This is an overlooked opportunity because natural language descriptions in specifications often contain relevant information, including example values ​​and dependencies between parameters, that can be used to improve test generation. Under this idea, we proposed NLPtoREST, an automated method that applies natural language processing technology to assist REST API testing. Given an API and its specification, NLPtoREST extracts additional OpenAPI rules from the human-readable part of the specification. It then enhances the specification by adding these rules to the original specification. Testing tools can transparently use enhanced specifications for better test case generation. Since rule extraction can be inaccurate, possibly due to inherent ambiguities in natural language or mismatches between documents and implementations, NLPtoREST also includes a validation step designed to eliminate spurious rules. We conducted research to evaluate the effectiveness of our rule extraction and verification approach, as well as the impact of enhanced specification on the performance of eight state-of-the-art REST API testing tools. Our results are encouraging, showing that NLPtoREST is able to extract many relevant rules with high accuracy, thereby significantly improving the performance of the testing tool.

Paper link: https://doi.org/10.1145/3597926.3598131

73684c97f629be266729e2ac2ca2e12e.png

25、Eunomia: Enabling User-Specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries

Although existing techniques have proposed automated methods to mitigate the path explosion problem in symbolic execution, users still need to optimize symbolic execution by carefully applying various search strategies. Since existing methods mainly support coarse-grained global search strategies, they cannot efficiently traverse complex code structures. This paper proposes Eunomia, a symbolic execution technology that supports fine-grained, local domain knowledge. Eunomia uses Aes, a domain-specific language that allows users to specify local search strategies for different parts of the program. Eunomia also isolates the variable context of different local search strategies to avoid conflicts. We implemented Eunomia using WebAssembly, which can analyze applications written in various languages. Eunomia is the first symbolic execution engine to support the full functionality of WebAssembly. We evaluate Eunomia using a microbenchmark suite and six real-world applications. Evaluation results show that Eunomia can improve error detection by up to three orders of magnitude. We also conducted user research that demonstrated the benefits of using Aes. Additionally, Eunomia verified six known bugs and detected two new zero-day bugs in Collections-C.

Paper link: https://doi.org/10.1145/3597926.3598064

402008a0a21a5f285de4f721675f230e.png

26、Finding Short Slow Inputs Faster with Grammar-Based Search

Recent research has shown that appropriately instrumented mutation searches can generate short inputs that demonstrate performance issues. Another direction in fuzzing research shows that replacement with subtrees in a forest of derivation trees is an effective syntax-based fuzzing technique for discovering deep semantic errors. We combine performance fuzzing with syntax-based search by generating length-constrained derivation trees, where each subtree is labeled with its length. Additionally, we use performance instrumented feedback to guide the search. In contrast to fuzz testing for security issues, we focus on search processes that are short enough (up to an hour, using modest computing resources) to become part of the regular incremental testing process. We evaluate combinations of these methods on benchmarks including the best prior performance fuzzing tools. No single search technique dominates across all examples, but Monte Carlo tree search and length-restricted tree hybridization perform consistently well in example applications where semantic performance errors can be found with syntactically correct input. During our evaluation, we discovered a hanging bug in LunaSVG, which the developers have since acknowledged and fixed.

Paper link: https://doi.org/10.1145/3597926.3598118

f26b192965970c181179319dd408fa05.png

27、Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree

Code clone detection aims to find similar code fragments and is increasingly important in the field of software engineering. There are several techniques for detecting code clones. Text-based or markup-based code clone detection tools are scalable and efficient, but lack syntax considerations and thus have poor performance in detecting syntactic code clones. Although some tree-based methods have been proposed to detect syntactic or semantic code clones with good performance, they are mostly time-consuming and lack scalability. In addition, these detection methods cannot achieve fine-grained code clone detection and cannot distinguish the specific code blocks that are cloned. In this paper, we design Tamer, a scalable and fine-grained tree-based syntactic code clone detection tool. Specifically, we propose a new method to transform complex abstract syntax trees into simple subtrees. It speeds up the detection process and enables fine-grained analysis of clone pairs to locate specific cloned parts of the code. To examine Tamer’s detection performance and scalability, we evaluate it on the widely used dataset BigCloneBench. Experimental results show that Tamer outperforms ten state-of-the-art code clone detection tools (i.e., CCAligner, SourcererCC, Siamese, NIL, NiCad, LVMapper, Deckard, Yang2018, CCFinder, and CloneWorks) in performance.

Paper link: https://doi.org/10.1145/3597926.3598040

c0c1d746da7629c169d4b2aa5fa22213.png

28、Fuzzing Deep Learning Compilers with HirGen

Deep learning (DL) compilers are widely used to optimize high-level DL models for efficient deployment on a variety of hardware. Their quality has a profound impact on the quality of the compiled DL model. A recent bug study showed that optimization of high-level intermediate representation (IR) is the most error-prone compilation stage, with errors in this stage accounting for 44.92% of the total errors collected. However, existing testing techniques do not consider features related to advanced optimization (such as advanced IR) and therefore fall short in exposing errors at this stage. To bridge this gap, we propose HirGen, an automated testing technique that can effectively expose coding errors in advanced IR optimization. HirGen's design includes: 1) three coverage criteria for generating diverse and efficient computational graphs; 2) leveraging the language features of high-level IR to generate diverse IR; 3) three test oracles, two of which are subject to cell testing and Inspiration from differential testing. HirGen has successfully detected 21 errors occurring in TVM, 17 of which were confirmed and 12 fixed. Furthermore, we build four benchmarks using a state-of-the-art DL compiler fuzzer that can cover advanced optimization stages. Our experimental results show that HirGen is able to detect 10 crashes and inconsistencies within 48 hours that the baseline cannot detect. We also evaluate the practicality of our proposed coverage criteria and test oracles.

Paper link: https://doi.org/10.1145/3597926.3598053

4377d5d5661b88b9a72b27a3944983ec.png

29、Fuzzing Embedded Systems using Debug Interfaces

Fuzz testing embedded systems is difficult. Their key components—microcontrollers—are highly diverse and difficult to virtualize; their software may not be changed or instrumented. However, we have observed that many, if not most, microcontrollers have debugging capabilities through a debug interface, through which debug probes (usually controllable through the GNU debugger GDB) can set a limited number of hardware interrupts. point. By using these hardware breakpoints, we can extract partial coverage feedback even for uninstrumented binaries, thereby enabling efficient fuzz testing of embedded systems through a general and widely used mechanism. When evaluated on four different microcontroller boards, our prototype implementation GDBFuzz quickly achieved high code coverage and detected known and new vulnerabilities. Since it can be applied to any program and system that GDB can debug, GDBFuzz is one of the least demanding and most flexible coverage-guided fuzzing tools.

Paper link: https://doi.org/10.1145/3597926.3598115

c9b83f569af854338c026483512ee17c.png

30、GDsmith: Detecting Bugs in Cypher Graph Database Engines

In the era of big data, graph database engines stand out for their efficiency in modeling and processing linked data. In order to ensure the high quality of the graph database engine, it is crucial for the graph database engine to perform automatic test generation, such as random test generation, which is the most commonly adopted method in practice. However, random test generation faces the challenge of generating complex inputs (i.e., property graphs and queries) to produce non-empty query results; generating such inputs is particularly important for detecting bugs with false results. To address this challenge, this paper presents GDsmith, a first-of-its-kind method for testing the Cypher graph database engine. GDsmith ensures that every randomly generated query satisfies semantic requirements. In order to increase the probability of generating complex queries that return non-empty results, GDsmith adopts two new technologies: graph-based complex pattern combination generation and data-based complex condition generation. Our evaluation results show that GDsmith is very effective and efficient for generating complex queries to detect erroneous results, and significantly outperforms benchmarks. GDsmith successfully detected 28 bugs on release versions of three very popular open source graph database engines and received positive feedback from developers.

Paper link: https://doi.org/10.1145/3597926.3598046

c8c527d39068c106abcaa20139c65325.png

31、GrayC: Greybox Fuzzing of Compilers and Analysers for C

Compiler and code analyzer fuzz testing has found and fixed many bugs in widely used frameworks such as LLVM, GCC and Frama-C. Most of these fuzzing techniques adopt a black-box approach, resulting in compilers and code analyzers being relatively immune to such fuzzers.

We propose a coverage-oriented and mutation-based approach for fuzzing C compilers and code analyzers, inspired by the success of such gray-box fuzz testing in other application domains. The main challenge in applying mutation-based fuzz testing in this context is that simple mutations are likely to produce programs that cannot be compiled. These programs are not useful for finding deep-seated errors that affect optimization, analysis, and code generation routines.

We designed a novel gray box for C compilers and code analyzers by developing a new set of mutations that target common C language constructs and transform obfuscated programs to produce meaningful output. Fuzz testing tool. This leverages differential testing as a test oracle and paves the way for the integration of programs generated by the fuzzer into compiler and code analyzer regression test suites.

We implemented our approach in a new open source tool called GrayC and experimentally demonstrated that it improves compiler and code analysis performance compared to other mutation-based approaches including Clang-Fuzzer, PolyGlot, and techniques similar to LangFuzz. Provides more coverage in the mid- and back-end stages of the server.

We have identified 30 compiler and code analyzer bugs using GrayC: 25 of these were previously unknown bugs (22 of which have been fixed based on our reports), while the other 5 had been independently identified before we discovered them Report confirmed errors. There are 3 more bug reports under investigation. In addition to the above results, we also contributed to the Clang/LLVM test suite by providing 24 simplified versions of coverage-enhanced test cases produced by GrayC, targeting 78 previously uncovered functions in the LLVM codebase.

Paper link: https://doi.org/10.1145/3597926.3598130

47a4e684b492158bae549b144c9f7fd9.png

32、Green Fuzzer Benchmarking

Over the past decade, fuzz testing has gained increasing attention due to its effectiveness in discovering vulnerabilities. However, during this period, the evaluation of fuzz testing tools has been challenging, mainly due to the lack of standardized benchmarks. To alleviate this problem, in 2020, Google released an open source benchmarking platform called FuzzBench, which is widely used for accurate fuzz testing tool evaluation.

However, a typical FuzzBench experiment takes years of CPU time to complete. If we consider that fuzzing tools are actively being developed and require empirical evaluation of any changes, benchmarking becomes both computationally resource-constrained and time-consuming. In this paper, we propose an eco-friendly benchmarking platform called GreenBench, which significantly speeds up the evaluation of fuzz testing tools compared to FuzzBench while maintaining very high accuracy.

Unlike FuzzBench, GreenBench significantly increases the number of benchmark tests while significantly shortening the duration of fuzz testing. As a result, GreenBench generates fuzz testing tool rankings that are almost as accurate as FuzzBench (correlation is very high), but GreenBench is 18 to 61 times faster. We discuss the implications of these findings for the fuzzing community.

Paper link: https://doi.org/10.1145/3597926.3598144

52f21b308c465fb13507f770eb8a6ab2.png

33、Green Fuzzing: A Saturation-Based Stopping Criterion using Vulnerability Prediction

Fuzz testing is a widely used automated testing technique that uses random inputs to cause a program to crash, thereby indicating security vulnerabilities. A difficult but important question is when to stop fuzzing activities. Typically, an activity is terminated when the number of crashes and/or code elements covered do not increase within a period of time. To avoid premature termination before a vulnerability is reached, it is often preferred to use code coverage rather than crash count to decide when to terminate an activity. However, the activity may only increase coverage of non-security-critical code, or trigger the same crash repeatedly. Therefore, code coverage and crash count tend to overestimate the effectiveness of fuzz testing, unnecessarily increasing the duration and cost of the testing process.

This article explores the trade-off between fuzzing time saved and the number of missed bugs when terminating an activity based on saturation of coverage rather than triggered crashes or regular function coverage. Based on a large-scale empirical evaluation of 30 open source C programs, which contain a total of 240 security vulnerabilities and 1280 fuzz testing activities, we first demonstrate a binary classification model trained on software based on known vulnerabilities (CVE), (Potentially) vulnerable functions can be reliably predicted using lightweight machine learning features derived from static application security testing tools and validated software metrics. Second, we show that our proposed termination criterion terminates a 24-hour fuzzing campaign 6-12 hours earlier than crashes and saturation of regular function coverage, while missing an average number of defects of less than 0.5 defects.

Paper link: https://doi.org/10.1145/3597926.3598043

09c4c80394f1e71fe0f5835100ce0b66.png

34、Guided Retraining to Enhance the Detection of Difficult Android Malware

The popularity of the Android operating system makes it an attractive target for malware developers. To evade detection, including machine learning-based techniques, attackers invest in creating malware that closely resembles legitimate applications, challenging the current state of the art and creating samples that are difficult to detect. In this paper, we propose a method based on supervised representation learning called Guided Retraining to improve the performance of malware detectors. To do this, we first split the experimental dataset into subsets of “easy” and “hard” samples, where the degree of difficulty is related to the predicted probability produced by the malware detector. For the subset of "easy" samples, the base malware detector is used to make the final predictions, since the error rate on this subset is by design lower. Our work targets the second subset containing “difficult” samples, where the probabilities make the classifier less confident in its predictions and the error rate is higher. We apply Guided Retraining method on these difficult samples to improve their classification. Guided Retraining leverages the correct predictions and errors of the underlying malware detector to guide the retraining process. Guided Retraining uses supervised contrastive learning to learn new embeddings for difficult samples and train an auxiliary classifier for the final prediction. We validate our method using more than 265,000 malware and benign applications, covering four state-of-the-art Android malware detection methods. Experimental results show that Guided Retraining can improve state-of-the-art detectors by eliminating prediction errors on difficult samples, reducing error rates by up to 45.19%. Furthermore, we also note that our approach is general and designed to enhance the performance of binary classifiers on other tasks besides Android malware detection.

Paper link: https://doi.org/10.1145/3597926.3598123

38444ba5fd68cd544011dbb71f29c727.png

35、Guiding Greybox Fuzzing with Mutation Testing

Gray-box fuzz testing and mutation testing are two popular but mostly independent areas of software testing research, with limited overlap between the two to date. Gray box fuzz testing is often used to search for new vulnerabilities and is primarily saved using code coverage selection inputs. Mutation testing is primarily used as a brute force alternative to code coverage for evaluating the quality of regression testing; the idea is to evaluate the test cases' ability to identify errors artificially injected in the target program. But what if we want to use gray-box fuzz testing to generate high-quality regression tests?

In this paper, we develop and evaluate Mu2, a Java-based framework that incorporates mutation analysis into gray-box fuzz testing loops, with the goal of generating test input libraries with high mutation scores. Mu2 leverages differentiated expected output to identify inputs that trigger interesting program behavior but do not cause crashes. This paper describes several dynamic optimizations implemented in Mu2 to overcome the high cost of mutation analysis using the input generated by each fuzzer. These optimizations introduce a trade-off between fuzzing throughput and mutation killing capabilities, which we empirically evaluate on five real-world Java benchmarks. Overall, Mu2's variants are able to generate test input libraries with higher mutation scores than the state-of-the-art Java fuzzer Zest.

Paper link: https://doi.org/10.1145/3597926.3598107

dcc47c183915f0015a899fc232d8586d.png

36、How Effective Are Neural Networks for Fixing Security Vulnerabilities

Security vulnerability remediation is a difficult task that desperately needs automation. Two groups of techniques show promise: (1) Large Code Language Models (LLMs) that have been pretrained on source code for tasks such as code completion; (2) Automated Program Repair (APR) techniques use Deep learning (DL) models automatically repair software defects. This paper is the first to study and compare the capabilities of LLMs and DL-based APR models in repairing Java vulnerabilities. Contributions include: (1) Application and evaluation of five LLMs (Codex, CodeGen, CodeT5, PLBART and InCoder), four fine-tuned LLMs and four based on two real-world Java vulnerability benchmarks (Vul4J and VJBench) DL's APR technology; (2) Designed code transformation to address the threat of Codex training and test data overlap; (3) Created a new Java bug fixing benchmark VJBench and its transformed version VJBench-trans to better evaluate LLMs and APR technology; (4) Evaluated the ability of LLMs and APR technology in VJBench-trans to repair conversion vulnerabilities. Our research results include: (1) Existing LLMs and APR models fix few Java vulnerabilities. Codex fixed 10.2 (20.4%) vulnerabilities, the largest number. Many of these generated patches fail to compile. (2) Fine-tuning using common APR data can improve the vulnerability repair capabilities of LLMs. (3) Our new VJBench reveals that LLMs and APR models fail to fix many Common Weakness Enumeration (CWE) types, such as CWE-325 missing encryption step and CWE-444 HTTP request smuggling. (4) Codex still fixed 8.7 vulnerabilities in conversion vulnerabilities, outperforming all other LLMs and APR models. The findings call for innovations to enhance automated Java vulnerability remediation, such as creating larger vulnerability remediation training data, using this data to fine-tune LLMs, and applying code reduction transformations to facilitate vulnerability remediation.

Paper link: https://doi.org/10.1145/3597926.3598135

86e96c060371fc14fb3278557ea597a6.png

37、Hybrid Inlining: A Framework for Compositional and Context-Sensitive Static Analysis

In cross-procedural static analysis, context sensitivity is crucial to obtain good accuracy. To achieve context sensitivity, top-down analysis requires fully inlining all statements in the called function at each call site, resulting in statement explosion. Combinatorial analysis extends by inlining summaries of all called functions, but typically loses precision because it is not strictly context sensitive. We propose a compositional and strictly context-sensitive static analysis framework. The framework is based on a key observation: combinatorial analysis usually loses precision only on a few key statements that require context-sensitive analysis. Our approach mixes inline critical statements with a summary of non-critical statements for each called function, thus avoiding the need to re-analyze non-critical statements. Furthermore, our analysis delays summarizing key statements when needed and stops propagating key statements once the calling context has accumulated enough. We designed and implemented several analyzes (including pointer analysis) based on this framework. Our evaluation of pointer analysis shows that it can analyze large Java programs from the DaCapo benchmark suite and industrial applications in minutes. Compared with context-free analysis, hybrid inlining only adds 65% and 1% additional time overhead on DaCapo and industrial applications, respectively.

Paper link: https://doi.org/10.1145/3597926.3598042

e71ebd80d2f551825dbff5887b5284b1.png

38、Icicle: A Re-designed Emulator for Grey-Box Firmware Fuzzing

Emulation-based fuzzers can test binaries without source code and can conveniently test embedded applications where automated execution on the target hardware architecture is difficult and slow. Added instrumentation techniques to extract feedback and guide input mutations to generate effective test cases are at the core of modern fuzzers. However, modern emulation-based fuzzers have evolved by repurposing general-purpose emulators; therefore, developing and integrating fuzzing techniques (e.g., instrumented approaches) is difficult and often in the form of instruction set architecture (ISA)-specific ad hoc way to add. This limits the application of existing fuzz testing techniques to some ISAs (such as x86/x86-64 or ARM/AArch64), which is an important issue for firmware fuzz testing of diverse ISAs. This study introduces our efforts to rethink simulation for fuzz testing. We designed and implemented a multi-architecture simulation framework for fuzz testing - Icicle. We demonstrate the ability to add instrumentation to Icicle once, in an architecture-independent manner, and with low execution overhead. We used Icicle as an emulator for Fuzzware, a state-of-the-art ARM firmware fuzzer, and replicated the results. Importantly, we show that the availability of new instrumentation in Icicle enables the discovery of new vulnerabilities. We demonstrate Icicle's fidelity and the effectiveness of architecture-agnostic instrumentation by discovering vulnerabilities in benchmarks on different instruction set architectures (x86-64, ARM/AArch64, RISC-V, MIPS) that require Instrument technology has known and specific operating capabilities. Additionally, to demonstrate Icicle’s effectiveness in discovering vulnerabilities for unsupported architectures in an emulation-based fuzzer, we conducted a fuzz test on actual firmware binaries of Texas Instruments’ MSP430 ISA and discovered seven new vulnerabilities.

Paper link: https://doi.org/10.1145/3597926.3598039

52926b0cf103fca228386ed6843efe0d.png

39、Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis

Binary code similarity analysis is the process of identifying a set of similar functions among a large number of candidate functions for functions in binary executable form. These similar functions are often compiled from the same source code using different compilation settings. This analysis has many applications, such as malware detection, code clone detection, and automated software patching. Current state-of-the-art methods employ complex deep learning models such as the Transformer model. We observe that these models are affected by undesirable instruction distribution bias caused by specific compiler conventions. We developed a new technique to detect and fix this bias by removing the corresponding instructions from the dataset and fine-tuning the model. This requires synergy between deep learning model analysis and program analysis. Our results show that we can significantly improve the performance of state-of-the-art models by up to 14.4% in the most challenging cases, where test data may not fit the training data distribution.

Paper link: https://doi.org/10.1145/3597926.3598121

028bbe91a42441ade21d928ee99d43a3.png

40、Interpreters for GNN-Based Vulnerability Detection: Are We There Yet?

Traditional vulnerability detection methods have limitations because they require a lot of manual work. Automated vulnerability detection methods have attracted the interest of researchers, especially deep learning, which has achieved remarkable results. Since graphics can convey the structural characteristics of code better than text, vulnerability detection methods based on graph neural networks (GNN) are significantly better than text-based methods. Therefore, GNN-based vulnerability detection methods are becoming more and more popular. However, for security analysts, GNN models are close to black boxes, so the models cannot provide clear evidence to explain why a certain code sample is detected as vulnerable or safe. At this stage, many GNN interpreters have been proposed. However, these interpreters suffer from highly inconsistent and unconvincing issues regarding the stability and credibility of the explanations provided by vulnerability detection models. To address the above issues, we propose principled guidelines based on concerns in vulnerability detection (stability, robustness, and effectiveness) to evaluate the quality of interpretation methods of GNN-based vulnerability detectors. We conducted extensive experiments to evaluate the interpretation performance of six well-known interpreters (GNN-LRP, DeepLIFT, GradCAM, GNNExplainer, PGExplainer, and SubGraphX) on four vulnerability detectors (DeepWukong, Devign, IVDetect, and Reveal). Experimental results show that the target interpreter performs poorly in terms of effectiveness, stability, and robustness. In terms of effectiveness, we find that instance-independent methods outperform other methods due to their deep insights into the detection model. In terms of stability, perturbation-based interpretation methods are more resilient to slight changes in model parameters because they are model independent. In terms of robustness, instance-independent methods provide more consistent interpretation results for similar vulnerabilities.

Paper link: https://doi.org/10.1145/3597926.3598145

9b7030c82b06834d23606ecddde95b8f.png

41、ItyFuzz: Snapshot-Based Fuzzer for Smart Contract

Smart contracts are critical financial instruments and their security is of paramount importance. However, smart contract programs are difficult to fuzz due to the persistent blockchain state behind all transactions. Mutating transaction sequences are complex and often lead to suboptimal exploration of input and program spaces. This article introduces ItyFuzz, a new snapshot-based fuzz testing tool for testing smart contracts. In ItyFuzz, we do not store sequences of transactions and mutate them, but snapshot states and singleton transactions. To explore interesting states, ItyFuzz introduces a data flow waypoint mechanism to identify states with greater potential momentum. ItyFuzz also incorporates comparison of path points to prune the state space. By keeping a snapshot of the state, ItyFuzz can quickly synthesize specific attacks, such as re-entry attacks. Since ItyFuzz has secondary response time for testing smart contracts, it can be used for on-chain testing, which has many advantages over local development testing. Finally, we evaluate ItyFuzz on real-world smart contracts and some hacked on-chain DeFi projects. In terms of instruction coverage, ItyFuzz outperforms existing fuzz testing tools and can quickly find and generate real attacks for on-chain projects.

Paper link: https://doi.org/10.1145/3597926.3598059

08462416839a8e4ee8ce3d950ffbdcfa.png

42、Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Deep learning (DL) systems have become very popular and ubiquitous in our daily lives. These systems are built on top of popular DL libraries such as TensorFlow and PyTorch, which provide the building blocks of DL systems. Detecting errors in these DL libraries is critical for almost all downstream DL systems to ensure validity/security for end users. At the same time, traditional fuzzing techniques hardly work in this challenging domain because the input DL program needs to satisfy both the syntax/semantics of the input language (such as Python) and the DL API input/shape constraints for fuzzing. Quantity calculation.

To address these limitations, we propose TitanFuzz - the first method to directly leverage large language models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are huge models trained on billions of code snippets to automatically generate human-like code snippets. Our key insight is that modern LLMs can also contain a large number of code snippets calling DL library APIs in their training corpora, and thus can implicitly learn the language's syntax/semantics and complex DL API constraints to generate efficient DL programs. Specifically, we use generative and filler LLMs (such as Codex/InCoder) to generate and mutate efficient/diverse input DL programs for fuzz testing. Our experimental results show that TitanFuzz's code coverage on TensorFlow/PyTorch improves by 30.38%/50.84% ​​compared with the state-of-the-art fuzz testing tools. In addition, TitanFuzz was able to find 65 bugs, 44 of which were confirmed as previously unknown bugs.

This paper shows that modern giant LLMs can be leveraged to directly perform generation-based and mutation-based fuzz testing that has been studied for decades, while being fully automated, generalizable, and applicable to areas that traditional methods (such as DL systems) struggle to cope with. . We hope that TitanFuzz will drive more work in this promising direction of LLMs for fuzz testing.

Paper link: https://doi.org/10.1145/3597926.3598067

a9d228b7568ea09db0fdcd8e799c91b5.png

43、NodeRT: Detecting Races in Node.js Applications Practically

Node.js has become one of the most popular development platforms due to its excellent concurrency support. However, in Node.js applications, races caused by the non-deterministic execution order of event handlers can cause serious runtime failures. NRace, the most advanced Node.js competition detection tool, uses a set of HB (happens-before) relationship rules to build an HB graph before detection. During the detection process, NRace utilizes a heavy-duty BFS-based algorithm to query the reachability between resource operations, which introduces a large amount of overhead in practice, making NRace unable to be applied to the real-world Node.js application testing process. This article proposes a more practical Node.js dynamic race detection method, called NodeRT (Node.js Race Tracker). In order to reduce unnecessary overhead, NodeRT simplifies the HB relationship rules and divides detection into three stages: trace collection stage, competing candidate detection stage and false positive removal stage. During the trace collection phase, NodeRT builds a partial HB graph called an asynchronous call tree (ACTree) to enable efficient reachability queries between event handlers. In the competition candidate detection phase, NodeRT detects ACTree, effectively eliminates most non-competition event handlers, and outputs competition candidates. In the false positive removal stage, NodeRT utilizes matching rules derived from HB relationship rules and resource characteristics to reduce false positives among competing candidates. In the experiment, NodeRT detected all known competition and 9 unknown harmful competitions, while NRace only detected 3 unknown harmful competitions, and the average time-consuming increase was 64 times. NodeRT significantly reduces overhead compared to NRace, making it a practical tool to integrate into real-world testing processes.

Paper link: https://doi.org/10.1145/3597926.3598139

c5183feafb977e4b70df0af42513b22b.png

44、OCFI: Make Function Entry Identification Hard Again

Function entry identification is a crucial and challenging task in binary disassemblers and has been the focus of research over the past decades. However, recent research has shown that call frame information (CFI) provides accurate and almost complete function entry information. With the help of CFI, the disassembler has significantly improved function entry detection. CFI is specifically designed for efficient stack unwinding, and in x64 and aarch64 architectures, every function has a corresponding CFI. However, not every function and instruction unwinds the stack at runtime, an observation that has led to the development of techniques such as obfuscation to make it difficult for the disassembler to detect functions.

We propose an OCFI prototype based on this observation for obfuscating CFI. The goal of OCFI is to hinder function detection in popular disassemblers that use CFI as a method of function entry detection. We evaluate OCFI on a large-scale dataset including real applications and automatically generated programs, and find that obfuscated CFI correctly unwinds the stack and makes function entry detection more difficult for popular disassemblers. Furthermore, on average, OCFI introduces only 4% size overhead and almost no runtime overhead.

Paper link: https://doi.org/10.1145/3597926.3598097

d9844fd66e33e21629676ed3b376d560.png

45、Precise and Efficient Patch Presence Test for Android Applications against Code Obfuscation

Third-party libraries (TPLs) are widely used by Android developers to implement new applications. However, TPLs often contain various vulnerabilities that attackers can exploit to cause catastrophic consequences to application users. Therefore, it is crucial to test whether the vulnerability has been fixed in the target application. However, existing techniques cannot effectively test patch presence for obfuscated applications, whereas obfuscation is ubiquitous in practice. To address the new challenges introduced by code obfuscation, this study proposes PHunter, a system that captures anti-obfuscation semantic features of patch-related methods to determine the presence of patches in target applications. Specifically, PHunter utilizes coarse-grained features to locate patch-related methods and compares fine-grained semantic similarities to determine whether the code has been fixed. Extensive evaluation on 94 CVEs and 200 applications shows that PHunter is able to outperform existing tools with an average accuracy of 97.1%, high efficiency and low false positive rate. Furthermore, PHunter is able to resist different obfuscation strategies. More importantly, PHunter helps eliminate false positives produced by existing TPL detection tools. In particular, it can help reduce false positives by up to 25.2% with an accuracy of 95.3%.

Paper link: https://doi.org/10.1145/3597926.3598061

e496b34e3e727b8f624b1129d2f3302c.png

46、Quantitative Policy Repair for Access Control on the Cloud

With the popularity of cloud computing, providing secure access to cloud-stored information has become a critical issue. Due to the complexity of access control policies, administrators may inadvertently allow unintended access to private information, which is a common cause of data breaches in cloud services. This paper proposes a quantitative symbolic analysis method for automatically repairing overly permissive access policies. We use SMT formulations to encode the semantics of access control policies and use model counts to evaluate their permissiveness. Given a policy, a permissibility bound, and a set of requests that should be allowed, we iteratively fix the policy through reductions and improvements in permissiveness in order to reach the permissiveness bound while still allowing the given set of requests. We demonstrate the effectiveness of our automated policy remediation technique by applying the method to policies written in Amazon's AWS Identity and Access Management (IAM) policy language.

Paper link: https://doi.org/10.1145/3597926.3598078

48c2be1fd16fcdef8525bc660d11c775.png

47、Quantitative Symbolic Similarity Analysis

Similarity analysis plays a vital role in various software engineering tasks, such as detecting software changes, merging versions, identifying plagiarism, and analyzing binary code. Equivalence analysis is a more rigorous form of similarity that focuses on determining whether different programs or different versions of the same program behave the same. While a large body of research exists on code and binary similarity and equivalence analysis, quantitative reasoning is lacking in these areas. Non-equivalence is an area that needs to be explored in depth, as it can manifest itself in different ways in the input domain space. This article highlights the importance of quantitative reasoning about non-equivalence arising from semantic differences. By reasoning quantitatively about non-equivalence, you can determine over which specific input ranges a program is equivalent or non-equivalent. We aim to fill the gap in quantitative reasoning in symbolic similarity analysis to achieve a more comprehensive understanding of program behavior.

Paper link: https://doi.org/10.1145/3597926.3605238

503e372bddd1f8554d6da4dbaa2f5592.png

48、Rare Path Guided Fuzzing

Starting from a random initial seed, the fuzzer looks for inputs that can trigger a bug or vulnerability. However, fuzzers often fail to generate inputs for program paths that satisfy restrictive branch conditions. In this paper, we show that it is possible to improve the performance of a fuzzing tool by first identifying rare paths in a program (i.e., program paths that generate path constraints via random inputs that are unlikely to be satisfied), and then generating inputs/seeds capable of triggering the rare paths. Coverage. Specifically, we propose two techniques: 1) using quantitative symbolic analysis to identify rare paths, 2) using path-guided common-mode execution to generate inputs that can explore these rare paths. We feed these inputs as initial seed sets to three state-of-the-art fuzzers. Our experimental evaluation on a set of programs shows that fuzzers using rare path-based seed sets achieve better coverage compared to random initial seeds.

Paper link: https://doi.org/10.1145/3597926.3598136

021f3ad4f9ced589a4d7c37001fb4a0e.png

49、Reasoning about MLIR Semantics through Effects and Handlers

MLIR is a novel compiler intermediate representation (IR) development framework. At its core, MLIR is a framework for standardizing grammatical fragments (dialects) and optimizations that can be combined on demand to form customized IRs. In this way, MLIR allows sharing of IR abstractions between different domains. With the rapid adoption of MLIR in industry, there is an urgent need for formal semantic technologies that match the flexibility and scalability provided by MLIR. We propose an MLIR semantics framework based on effect handlers that can specify dialect semantics in a modular and composable manner, in parallel with MLIR. We also describe several research directions in handler-based MLIR semantics.

Paper link: https://doi.org/10.1145/3597926.3605239

5f28e4ecc68c15055eefea635ec7b1b3.png

50、Reducing the Memory Footprint of IFDS-Based Data-Flow Analyses using Fine-Grained Garbage Collection

The IFDS algorithm requires both large amounts of memory and computing resources for large programs because it requires storing a large number of path edges in memory and processing them until a fixed point is reached. Generally speaking, IFDS-based data flow analysis, such as taint analysis, aims to discover data flow facts only at certain program points. Maintaining a large number of path edges, many of which are accessed only once, wastes memory resources, reducing its scalability and efficiency (due to frequent rehashing operations for the path edge data structures). This paper introduces a fine-grained garbage collection (GC) algorithm to enable (multi-threaded) IFDS to reduce by removing inactive path edges (i.e. edges that are no longer needed to establish other path edges) from its path edge data structure Memory usage. The resulting IFDS algorithm, called FPC, maintains the correctness, precision, and termination properties of IFDS while avoiding the repeated processing of GCed path edges (which may form in the presence of unknown recursive loops) in future iterations of the analysis. . Unlike CleanDroid, which uses a coarse-grained GC algorithm to collect path edges at the method level, FPC achieves fine-grained control by collecting path edges at the data fact level. Therefore, FPC can collect more path edges than CleanDroid, and thus there are fewer rehashing operations for the used path edge data structures. In our evaluation, we focus on applying IFDS-based taint analysis to a set of 28 Android applications. FPC can analyze three applications scalably, while CleanDroid cannot fully run within the 3-hour budget per application (because of insufficient memory). For the remaining 25 applications, FPC reduces the number of path edges and memory usage produced by CleanDroid by 4.4x and 1.4x, respectively, so that on average, FPC outperforms CleanDroid by 1.7x (achievable in the best case 18.5 times).

Paper link: https://doi.org/10.1145/3597926.3598041

3fefee18d11a7484654678205746f1c2.png

51、Security Checking of Trigger-Action-Programming Smart Home Integrations

The Internet of Things (IoT) has been widely used in various fields, especially in the context of home automation (HA). In order to better control HA-IoT devices, especially integrating multiple devices to achieve rich intelligent functions, trigger-action programming, such as "If This Then That" (IFTTT), has become a popular paradigm. Leveraging this paradigm, novice users can easily specify their intent in an applet on how to control a device/service via another device/service when specific conditions are met. However, users may design IFTTT-style integrations inappropriately due to a lack of security experience or awareness of the security implications of cyberattacks targeting individual devices. This results in financial losses, privacy breaches, unauthorized access, and other security issues. To address these issues, this paper proposes a system framework called MEDIC for modeling and security inspection of smart home integration. It can automatically generate models that include service/device behavior and action rules of applets, taking into account external attacks and vulnerabilities within the device. Our method takes approximately one second to complete an integrated modeling and inspection. We created 200 integrations and conducted experiments based on a user study and a data set scraped from ifttt.com. Surprisingly, nearly 83% of these integrations had security issues.

Paper link: https://doi.org/10.1145/3597926.3598084

8c97e3ea1b26c2b27d396020b79dbf07.png

52、Semantic-Based Neural Network Repair

Recently, neural networks have been widely used in many safety-critical systems and other fields. Neural networks are built (and trained) through programming in frameworks such as TensorFlow and PyTorch. Developers can apply a rich set of predefined layers to manually program neural networks, or automatically generate neural networks through automated machine learning such as AutoML. Combining neural networks with different layers is error-prone because of the non-trivial constraints that must be met to use these layers. In this work, we propose a method to automatically repair faulty neural networks. The challenge is to identify the minimal modifications to the network that would make it effective. Modifying one layer can have cascading effects on subsequent layers, so our approach must search recursively to identify "global" minimal modifications. Our approach is based on the executable semantics of deep learning layers and focuses on four types of errors that are common in practice. We evaluate our approach in two usage scenarios, namely repairing common model errors in automatically generated neural networks and manually written neural networks. The results show that we are able to effectively and efficiently repair 100% (average repair time 21.08 seconds) of a set of randomly generated neural networks generated using existing AI framework testing methods, as well as a set of real neural networks 93.75% wrong (average time 3 minutes 40 seconds).

Paper link: https://doi.org/10.1145/3597926.3598045

e147f628015bbed2ba745812abc74f15.png

53、SmartState: Detecting State-Reverting Vulnerabilities in Smart Contracts via Fine-Grained State-Dependency Analysis

Smart contracts written by Solidity are widely used in different blockchain platforms such as Ethereum, TRON and BNB Chain. A unique design in Solidity smart contracts is its state restoration mechanism for error handling and access control. However, some recent security incidents have shown that attackers also exploit this mechanism to manipulate the critical state of smart contracts, leading to security consequences such as illicit profits and denial of service (DoS). This article calls this vulnerability a State Restoration Vulnerability (SRV). Automatically identifying SRVs presents unique challenges as it requires in-depth analysis and understanding of state dependencies in smart contracts.

This paper proposes SmartState, a new framework for detecting state restoration vulnerabilities in Solidity smart contracts through fine-grained state dependency analysis. SmartState integrates a novel mechanism to ensure its effectiveness. In particular, SmartState extracts state dependencies from contract bytecode and historical transactions, both of which are crucial for inferring SRV-related dependencies. Furthermore, SmartState models the common patterns of SRV (i.e., profit capture and DoS) as SRV indicators and effectively identifies SRV based on the constructed state dependency graph. To evaluate SmartState, we manually annotated a benchmark dataset containing 91 real-world SRVs. The evaluation results show that SmartState achieved a precision of 87.23% and a recall rate of 89.13%. Additionally, SmartState successfully identified 406 new SRVs from 47,351 real-world smart contracts. 11 of the SRVs come from popular smart contracts with higher transaction amounts (i.e., the top 2,000). Overall, our reported SRV affected a total value of $428,600 in digital assets.

Paper link: https://doi.org/10.1145/3597926.3598111

ad36d36295377c746cbd23fba9589063.png

54、Splendor: Static Detection of Stored XSS in Modern Web Applications

In modern websites, stored cross-site scripting (XSS) attacks are the most dangerous XSS vulnerabilities, which can store malicious code in the website system and be triggered directly by the victim. As the most commonly used data storage medium on websites, databases (DB) are also the place where stored XSS occurs most often. Due to the modular nature of modern programming architecture, complex underlying database operations are usually encapsulated and abstracted into a data access layer (DAL) to provide unified data access services to the business layer. The widespread use of object-oriented (OO) and dynamic language features in the encapsulation process makes it increasingly difficult for static taint analysis tools to understand the flow path of tainted data between the source code and the database. This article proposes a static analysis framework for detecting stored XSS using DAL in modern web applications, and implements a PHP code analysis prototype called Splendor. The highlight of this framework is the design of a heuristic but precise token matching method for locating tainted data flow paths between the database and source code. The accuracy of the identified database read and write (R/W) locations was 91.3% and 82.6% respectively. Through the identified R/W locations, the taint paths can be statically connected to obtain the complete taint propagation path of stored XSS. A large-scale experimental comparison with five existing real-world applications and a PHP web application on GitHub shows that Splendor significantly outperforms state-of-the-art static and dynamic methods in stored XSS detection and discovered 17 zero-days. loopholes.

Paper link: https://doi.org/10.1145/3597926.3598116

c27510564913fe6d98878f71d7fca417.png

55、SymRustC: A Hybrid Fuzzer for Rust

We propose SymRustC, a hybrid fuzzing tool for Rust. SymRustC is hybrid in that it combines fuzz testing and conjugate execution. SymRustC leverages the conjugate execution capabilities of an existing tool called SymCC, and the fuzz testing capabilities of another existing tool called LibAFL. Since SymCC instruments LLVM IR (intermediate representation) for conjugate execution, and the Rust compiler uses LLVM as a backend, we integrate SymCC with the Rust compiler to provide instrumented support for conjugate execution for Rust programs. LibAFL provides a framework for developing fuzzing tools, which we used to develop a hybrid fuzzing tool that combines fuzzing with our conjugate execution. We discuss our implementation along with four case studies to demonstrate that SymRustC can generate input for finding bugs in Rust programs.

Paper link: https://doi.org/10.1145/3597926.3604927

46665214c80ea1646355ed0b08e34228.png

56、Systematic Testing of the Data-Poisoning Robustness of KNN

Data pollution aims to compromise machine learning-based software components by contaminating the training set to alter its predictions on test inputs. Existing methods for determining robustness to data contamination either have low accuracy or long running times, and more importantly, they can only prove some truly robust cases but fail to draw conclusions when authentication fails. In other words, they cannot prove truly non-robust situations. To overcome this limitation, we propose an approach based on systematic testing that can demonstrate and substantiate the data contamination robustness of the widely used supervised learning technique k-nearest neighbors (KNN). Our approach is faster and more accurate than baseline enumeration methods due to novel process analysis in the abstract domain to quickly narrow the search space, and systematic testing in the concrete domain to find actual violations. We evaluate our method on a set of supervised learning datasets. The results show that the method significantly outperforms the state-of-the-art techniques and is able to determine the data contamination robustness of KNN prediction results on most test inputs.

Paper link: https://doi.org/10.1145/3597926.3598129

27efc7a44273f79658e8d6a32bad9b15.png

57、Tai-e: A Developer-Friendly Static Analysis Framework for Java by Harnessing the Good Designs of Classics

Static analysis is a mature field and is used in error detection, security analysis, program understanding, optimization, etc. To facilitate these applications, static analysis frameworks play an important role, providing a series of basic services such as intermediate representation (IR) generation, control flow graph construction, pointer/alias information calculation, etc. However, despite the tremendous progress in static analysis and the emergence of several well-known frameworks in the past few decades, these frameworks are not easy to learn and use for the developers who rely on them to create and implement analyses. In this sense, building a developer-friendly static analysis framework is not a trivial matter, since we have much less knowledge in designing and implementing static analysis frameworks than the knowledge required for static analysis itself.

In this work, we adopt the principle of "leverage classic design" to select design solutions by discussing the design trade-offs of key components of Java static analysis framework: for each key component of static analysis framework, we compare different classic frameworks such as Soot , Wala, Doop, SpotBugs, and Checker) and choose a more suitable solution; but if no solution is good enough, we propose a better design. These selected or newly proposed designs ultimately constitute Tai-e, a completely new Java static analysis framework implemented from the ground up. Tai-e is innovative in its design in many aspects such as IR, pointer analysis and new analysis development, thus achieving a developer-friendly (easy to learn and use) analysis framework. To the best of our knowledge, this is the first work that systematically explores the design and implementation of various static analysis frameworks for Java. We hope it provides useful material and perspectives for building better static analysis infrastructure, and that it brings more community attention to this challenging but practical topic.

Paper link: https://doi.org/10.1145/3597926.3598120

8c62fa84ea7c225de150e098c038f649.png

58、Testing Automated Driving Systems by Breaking Many Laws Efficiently

Autonomous Driving Systems (ADS) serve as the brains of autonomous vehicles (AVs) and require comprehensive testing before deployment.

ADS must meet a complex set of regulations to ensure road safety, such as existing traffic regulations and future regulations that may specifically target AVs.

In order to comprehensively test ADS, we want to systematically discover various scenarios in which specific traffic laws are violated. The challenge is that, on the one hand, there are many traffic laws (for example, there are 13 testable provisions in China's traffic laws and 16 testable provisions in Singapore's traffic laws, covering 81 and 43 types of violations respectively); on the other hand, Many traffic laws only apply in complex specific scenarios.

Existing ADS testing methods either focus on simple rules (such as no collision) or have limited capabilities in generating diverse violation scenarios.

In this study, we propose ABLE, a new ADS testing method inspired by the success of GFlowNet and designed to efficiently violate multiple regulations by generating various scenarios.

Unlike traditional GFlowNet, ABLE dynamically updates test targets based on the robust semantics and active learning of robust signal temporal logic to effectively explore a vast search space.

Our evaluation of ABLE based on Apollo and LGSVL showed that when testing Apollo 6.0 and Apollo 7.0, ABLE violated more (17% and 25%, respectively) regulations, most of which were hard-to-violate regulations, exceeding current level of technology.

Paper link: https://doi.org/10.1145/3597926.3598108

c3c92a059d0c68677509638bc0f85727.png

59、Testing the Compiler for a New-Born Programming Language: An Industrial Case Study (Experience Paper)

Due to the important role of compilers, many compiler testing techniques have been proposed, of which the two most noteworthy categories are syntax-based and transformation-based techniques. All of these techniques have been extensively studied for testing mature compilers. In practice, however, it is often necessary to develop a new compiler for a nascent programming language. In this case, existing techniques are difficult to apply for some important reasons: (1) there is no reference compiler to support differential testing, (2) there is a lack of program analysis tools to support most variant-based compiler testing, (3) A large amount of implementation work caused by different programming language features. Therefore, it is unclear how existing technologies will perform in this new scenario.

In this work, we conduct a first exploration (i.e., an industrial case study) to substantially improve the performance of existing technologies adapted to this new scenario. We adapt syntax-based compiler testing techniques to this scenario by synthesizing new test programs based on code snippets and using compilation crashes as test expected results, since there is no reference compiler for differential testing. We also adapt transformation-based compiler testing techniques to this scenario by constructing equivalent test programs to alleviate the dependence on program analysis tools. We call these improved techniques SynFuzz and MetaFuzz respectively.

We evaluated SynFuzz and MetaFuzz on two versions of a new compiler for a nascent programming language at a global IT company. By comparing with the testing practices adopted by the testing team and the general fuzz tester (AFL), SynFuzz was able to detect more bugs in the same testing time, and both SynFuzz and MetaFuzz can complement the two technologies. In particular, SynFuzz and MetaFuzz detected a total of 11 previously unknown bugs that were fixed by developers. Based on industrial case studies, we summarize a series of lessons learned and recommendations for practical applications and future research.

Paper link: https://doi.org/10.1145/3597926.3598077

e0f2f45202baf415cef41e2da834ffb5.png

60、Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?

Existing software composition analysis (SCA) techniques for the C/C++ ecosystem tend to identify reused components through feature matching between target software projects and collected third-party libraries (TPLs). However, feature duplication due to internal code cloning may lead to inaccurate SCA results. To alleviate this problem, Centris is proposed, a state-of-the-art SCA technology for the C/C++ ecosystem that adopts function-level code clone detection to derive TPL dependencies to eliminate redundant features before executing SCA tasks. While Centris was proven effective in the original paper, the accuracy of its derived TPL dependencies has not been evaluated. Furthermore, data sets for assessing the impact of TPL dependence on SCA are limited. To further investigate the effectiveness and limitations of Centris, we first constructed two large-scale ground truth datasets to evaluate the accuracy of derived TPL dependencies and SCA results, respectively. We then extensively evaluated Centris, and the evaluation results suggested that the accuracy of Centris-derived TPL dependencies may not be applicable to our evaluation dataset. We further infer that the key factors leading to performance degradation may be inaccurate function occurrence time and threshold-based recall. In addition, the impact of Centris-derived TPL dependency on SCA may have certain limitations. Inspired by our findings, we propose TPLite, which adopts function-level origin-based TPL detection and graph-based dependency recall to enhance the accuracy of TPL reuse detection in the C/C++ ecosystem. Our evaluation results show that compared to Centris, TPLite effectively increases the precision of TPL dependencies from 35.71% to 88.33% and the recall from 49.44% to 62.65%. Furthermore, TPLite increases the precision from 21.08% to 75.90% and the recall from 57.62% to 64.17%, which is comparable to the SOTA academic SCA tool B2SFinder and even better than the widely adopted commercial SCA tool BDBA, i.e. the precision goes from 72.46% increased to 75.90%, increasing the recall from 58.55% to 64.17%.

Paper link: https://doi.org/10.1145/3597926.3598143

13a461ee9e4496545f2d0b2c9230b437.png

61、To Kill a Mutant: An Empirical Study of Mutation Testing Kills

Mutation testing has been used and studied for more than four decades as a method of assessing the strength of test suites. The technique adds an artificial flaw (i.e., a mutation) to a program to generate a variant, and then runs a suite of tests to determine whether any of the test cases within it are sufficient to detect the mutation (i.e., the killing variant). In this case, the one that causes the test case to fail is the one that kills the variant. However, we know little about the nature of these failures. This article presents an empirical study designed to explore the nature of these failures. We try to answer the following question: How do test cases fail, leading to the killing of the mutant? For each mutant that is killed, how many test cases fail, considering that only one failure is needed to kill the mutant? How do program crashes lead to killing mutants, and what is the origin and nature of these crashes? We found several revealing results across all experimental subjects, including the significant contribution of "crash" to test failures leading to killing variants, the causes of test failures being diverse even for a single mutation, and the often triggering The specific exception type that crashed. We believe that this study and its results should be considered by practitioners when using mutation tests and interpreting their mutation scores, and by researchers when studying and utilizing mutation tests in their future work.

Paper link: https://doi.org/10.1145/3597926.3598090

84b67889ba9ba9616a04ea509a5b0961.png

62、Toward Automated Detecting Unanticipated Price Feed in Smart Contract

Smart contract-based decentralized finance (DeFi) reached over $200 billion in total value locked (TVL) in 2022. In the DeFi ecosystem, price oracles play a key role in providing real-time price data for cryptocurrencies to ensure accurate asset pricing in smart contracts. However, price oracles also face security issues, including the possibility of unexpected price data, which can lead to debt and asset imbalances in DeFi protocols. However, existing solutions cannot effectively combine transactions and code to monitor price oracles in real time.

To address this limitation, we first classify price oracles into DON oracles, DEX oracles, or internal oracles based on trusted parties, and analyze their security risks, data sources, price duration, and query fees. We then propose VeriOracle, a formal verification framework for automatically detecting unexpected price data in smart contracts. VeriOracle can deploy a formal semantic model of price oracles on the blockchain, detect the status of smart contracts in real time, and identify transactions with unexpected price data. We apply VeriOracle to the real-world verification of over 500,000 transactions across 13 vulnerable DeFi protocols. Experimental results show that: (1) VeriOracle is effective, it can detect unexpected price data before a DeFi attack (33,714 blocks ahead of the attacker in the best case); (2) VeriOracle is efficient, its verification time is ( About 4 seconds) is smaller than Ethereum’s block time (about 14 seconds), which means VeriOracle can detect unsafe transactions in real time; (3) VeriOracle is scalable and used to verify defense strategies. Attacks using unexpected price data can only succeed in certain smart contract states. VeriOracle can verify which smart contract states are resistant to attacks.

Paper link: https://doi.org/10.1145/3597926.3598133

23af59f66f4d33997aca5c940ae09b3b.png

63、Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond

Recently, fine-tuning pretrained code models such as CodeBERT has achieved great success in many software testing and analysis tasks. Although this approach is effective and common, fine-tuning the pre-training parameters incurs a significant computational cost. Through an extensive experimental study, this paper explores the changes in code knowledge of layer-by-layer pre-trained representations and encodings during fine-tuning. We then propose alternative methods for efficient fine-tuning of large pre-trained code models based on these findings. Our experimental study shows that: (1) The lexical, syntactic and structural properties of the source code are encoded in the lower, middle and higher layers respectively, while the semantic properties span the entire model. (2) The fine-tuning process retains most of the code features. Specifically, the basic code characteristics captured by the lower and middle layers remain preserved during the fine-tuning process. Furthermore, we found that only the representations of the top two layers changed the most during fine-tuning on various downstream tasks. (3) Based on the above findings, we proposed Telly to efficiently fine-tune pre-trained code models through layer freezing. Extensive experimental results on five different downstream tasks show that training parameters and corresponding time costs are significantly reduced, while performance is equivalent or better.

Paper link: https://doi.org/10.1145/3597926.3598036

69d90208e25e531d84be49fe62f1a339.png

64、TreeLine and SlackLine: Grammar-Based Performance Fuzzing on Coffee Break

TreeLine and SlackLine are grammar-based fuzzing tools for quickly discovering performance issues in programs driven by richly structured text described by context-free grammars. Unlike lengthy fuzz testing campaigns, which are designed to find (mostly invalid) inputs that trigger security vulnerabilities, TreeLine and SlackLine are designed to search for performance issues with valid inputs in minutes rather than hours. The front ends of TreeLine and SlackLine differ in their search strategies (Monte Carlo tree search or derived tree splicing, respectively), but accept the same syntax specifications and rely on a common backend for instrumented execution. The separation of concerns should be convenient for other researchers who wish to explore alternatives and extensions to the front-end or back-end.

Paper link: https://doi.org/10.1145/3597926.3604925

3adfdeb3e4d7d6758d48fb4f9f3596cf.png

65、Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)

The complexity of software systems and the diversity of security vulnerabilities are credible sources of ongoing challenges in software vulnerability research. Applying deep learning methods for automated vulnerability detection has proven to be an effective means of supplementing traditional detection methods. However, the lack of qualified benchmark datasets may severely limit the effectiveness of deep learning-based vulnerability detection techniques. Specifically, the persistence of incorrect labels in existing vulnerability datasets can lead to inaccurate, biased, or even erroneous results. In this article, we aim to provide an in-depth understanding and explanation of the causes of labeling errors. To this end, we systematically analyze the diverse datasets used by modern learning vulnerability detection methods and examine the techniques they use to collect datasets of vulnerable source code. We found that labeling errors seriously affected mainstream vulnerability detection models, with the average F1 score dropping by 20.7% in the worst case. As a mitigation measure, we introduce two methods for data set denoising, which can improve model performance by 10.4% on average. Utilizing dataset denoising methods, we provide a feasible solution to obtain high-quality labeled datasets.

Paper link: https://doi.org/10.1145/3597926.3598037

1851932f7c63399ffe33ff2e88369d7c.png

66、Validating Multimedia Content Moderation Software via Semantic Fusion

The exponential growth of social media platforms (such as Facebook, Instagram, YouTube and TikTok) has completely changed the way human society communicates and publishes content. On these platforms, users can publish multimedia content that conveys information through a combination of text, audio, pictures, and video. At the same time, the ease of publishing multimedia content is increasingly being used to spread harmful content, such as hate speech, malicious advertising, and pornography. To this end, content moderation software is widely deployed on these platforms to detect and block harmful content. However, due to the complexity of content moderation models and the difficulty in understanding information across multiple modalities, existing content moderation software may not be able to detect harmful content, which often leads to extremely negative effects (such as harmful effects on adolescent mental health). .

We introduce semantic fusion, a general and efficient methodology for validating multimedia content moderation software. Our main idea is to fuse two or more existing unimodal inputs (e.g., text sentences and images) into a new input that combines the semantics of its ancestors in a novel way and possesses deleterious properties. This fused input is then used to validate the multimedia content moderation software. We implement semantic fusion as DUO, a practical content moderation software testing tool. In our evaluation, we used DUO to test five commercial content moderation software and two state-of-the-art models against three types of harmful content. The results showed that DUO achieved an error detection rate (EFR) of 100% when testing audit software, and an EFR of 94.1% when testing state-of-the-art models. In addition, we used the test cases generated by DUO to retrain the two models we studied, thereby greatly improving the model's robustness (2.5%~5.7% EFR) while maintaining the original test set accuracy.

Paper link: https://doi.org/10.1145/3597926.3598079

d1e8a7d52f52fd4a993b3873df9f246f.png

Guess you like

Origin blog.csdn.net/riusksk/article/details/132331718