The top academic conference on software engineering - ASE 2022 paper (network security direction) list, abstract and summary

Summarize

Popular research directions

Vulnerability detection technology based on deep learning : Multiple papers explore how to use deep learning models to detect security vulnerabilities in code, such as VDet for Java, ReVulDL, etc., and multiple papers explore how to design more efficient and automated fuzz testing technology , such as vulnerability-guided fuzz testing based on reinforcement learning, syntax-independent database management system fuzz testing, etc. This shows the hot research direction of vulnerability mining based on deep learning.
Automatic program repair technology : Many papers explore how to use machine learning and program analysis technology to automatically repair program defects, such as TransRepair, SelfAPR, etc. This shows that program auto-repair is a hot topic.
Privacy Protection in Mobile Applications : Several papers have discussed in detail the issue of user privacy in mobile augmented reality applications, and this is clearly a research area of great concern.

Unpopular research directions

Cross-ecosystem vulnerability impact analysis . For example, the Insight paper analyzes the problem of vulnerability propagation between components in different software ecosystems.
Detection of unfriendly communication between developers . For example, the ToxiCR paper discusses how to detect unfriendly comments made by developers during code reviews.
Game safe . For example, blockchain technology can be used to improve the security of game asset transactions.
Self-driving test platform : Paper 4 introduces a simulation test platform for the self-driving system called ADEPT. There is relatively little research in this field.

Directions worthy of future research

Explainability and trustworthiness of artificial intelligence systems . Several papers propose techniques for evaluating and improving the safety of autonomous driving systems. This shows that ensuring the safety and reliability of artificial intelligence systems is an important direction. In addition, with the development of deep learning technology, how to use deep learning for more effective software vulnerability detection will also be an important research direction.
Privacy protection for mobile applications and IoT systems . For example, analyze the attitude of mobile apps towards privacy, detect excessive access problems, etc. Protecting user privacy is an important direction.
Development of practical security solutions . For example, integrating manual labor into automated processes to develop lightweight and reliable security tools. Improving the effectiveness of safety technology applications is also critical.

Future security development trends and response suggestions

With the development of technology, network security will rely more on automated and intelligent technologies, such as deep learning and machine learning. In addition, with the widespread application of new technologies such as the Internet of Things and cloud computing, network security will need to be protected at more levels and in a wider range. The future development trend is that security needs to be considered more in conjunction with other quality attributes and social factors, and overall reliability and credibility should be continuously improved through continuous innovation. Current security research needs to pay more attention to user needs, consider practical application scenarios, develop simple and practical technologies and tools, and strengthen exchanges and cooperation with other research fields.

1、A Novel Coverage-guided Greybox Fuzzing based on Power Schedule Optimization with Time Complexity

Coverage-guided gray-box fuzz testing is considered a practical method for detecting software vulnerabilities, with the goal of maximizing code coverage. A common implementation is to allocate more energy to seeds that can find new boundaries in shorter execution times. However, just considering new boundaries may be less effective because there are often branches that are difficult to find in the complex code of a program. Code complexity is one of the key metrics for measuring code security. Programs with higher code complexity are more likely to find more branches and cause security issues than code with a simple structure. This paper proposes a novel fuzz testing method to further utilize code complexity in AFL (American Fuzzy Lop) and AFLFAST (American Fuzzy Lop Fast) to optimize the energy scheduling process. The goal of our approach is to generate inputs that are more biased toward the higher code complexity of the program under test. Furthermore, we conduct a preliminary empirical study under three widely used real-world programs, and the experimental results show that the proposed method can trigger more crashes and improve coverage discovery.

Paper link: https://doi.org/10.1145/3551349.3559550

2、A Study of User Privacy in Android Mobile AR Apps

With the development of augmented reality (AR) technology, mobile AR applications (MAR applications) are rapidly growing in all aspects of people's daily lives, such as games, shopping, and education. Compared to traditional apps, AR apps typically require constant access to the smartphone’s camera and collect and analyze more data, such as sensor data, geolocation, and biometric information. New privacy concerns have arisen due to the sensitivity and magnitude of data collected by MAR applications. This article describes a preliminary empirical study on Android MAR applications, focusing on the sensitive data collected by MAR applications, whether the data collected is well protected, and whether data practices are publicly visible so that users can understand data security and decide to install Make informed decisions about which applications to use. In this study, we analyze 390 real MAR applications and report on the dangerous permissions they request, the data leaks they detect, and the availability of their data security.

Paper link: https://doi.org/10.1145/3551349.3560512

3、A transformer-based IDE plugin for vulnerability detection

Automated vulnerability detection is critical to improving application security and should be performed early in the software development life cycle (SDLC) to reduce risk. Despite advances in state-of-the-art deep learning techniques for software vulnerability detection, development environments have yet to take full advantage of their performance. In this work, we combine the Transformer architecture with one of the major highlights in deep learning advances for natural language processing to develop a code security tool for developers: VDet for Java. VDet for Java is a Transformer-based VS Code extension that helps users discover vulnerabilities in Java files. Our preliminary model evaluation shows that multi-label classification achieves 98.9% accuracy and can detect up to 21 vulnerability types. A video demonstrating our tool can be found at [https://youtu.be/OjiUBQ6TdqE↗](https://youtu.be/OjiUBQ6TdqE), and the source code and dataset can be found at [https://github.com/TQRG /VDET-for-JavaGet↗](https://github.com/TQRG/VDET-for-JavaGet).

Paper link: https://doi.org/10.1145/3551349.3559534

4、ADEPT: A Testing Platform for Simulated Autonomous Driving

Recently, effective quality assurance methods for autonomous driving systems (ADS) have attracted increasing attention. This paper reports on a new test platform ADEPT, which aims to provide practical and comprehensive testing facilities for deep neural network-based ADS. ADEPT is based on the virtual simulator CARLA and provides many testing facilities, such as scenario construction, ADS import, test execution and recording, etc. In particular, ADEPT has two unique autonomous driving test scenario generation strategies. First, we used real accident reports and applied natural language processing technology to create rich driving scenarios. Secondly, we comprehensively consider the feedback from ADS and synthesize a physically robust adversarial attack, thereby being able to generate closed-loop test scenarios. Experiments confirmed the effectiveness of the platform.

Paper link: https://doi.org/10.1145/3551349.3559528

5、AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

The goal of pre-trained language models is to learn contextual representations of text data. Pretrained language models have become mainstream in natural language processing and code modeling. Using probe techniques to study the linguistic properties of hidden vector spaces, previous research has shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, previous work did not evaluate whether these models encode the complete syntactic structure of the programming language. This paper demonstrates that there exists a syntactic subspace in the hidden representation of a pre-trained language model, which contains syntactic information for programming languages. We show that this subspace can be extracted from the model's representation and define a new probing method, AST-Probe, that can recover a complete abstract syntax tree (AST) of an input code fragment. In our experiments, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. Furthermore, we emphasize that the middle layer of the model carries most of the AST information. Finally, we estimate the optimal size of this syntax subspace and show that its dimensions are much smaller than the model's representation space. This suggests that the pretrained language model uses only a small portion of its representation space to encode the syntactic information of the programming language.

Paper link: https://doi.org/10.1145/3551349.3556900

6、ASTOR: An Approach to Identify Security Code Reviews

During code reviews, software developers often raise security concerns. Ignoring these issues can severely impact the performance of your software product. If we can automatically identify code reviews that raise security issues, this risk can be mitigated with additional review by security experts. Therefore, the goal of this study is to develop an automated tool to identify code reviews that raise security issues. To achieve this goal, I developed a method called ASTOR, which combines two independent deep learning-based classifiers - (i) using code review comments and (ii) using corresponding code context, and Ensemble using logistic regression. Based on stratified ten-fold cross-validation, the best ensemble model achieved an F1 score of 79.8% and an accuracy of 88.4%, which can automatically identify code reviews that cause security issues.

Paper link: https://doi.org/10.1145/3551349.3559509

7、AUSERA: Automated Security Vulnerability Detection for Android Apps

To reduce the attack surface of application source code, many tools focus on detecting security vulnerabilities in Android applications. However, some apparent weaknesses have been highlighted in previous studies. For example, most of the available tools such as AndroBugs, MobSF, Qark, and Super use a pattern-based approach to detect security vulnerabilities. While they are effective at detecting certain types of vulnerabilities, they introduce a large number of false positives, inevitably increasing the remediation burden on application developers. Likewise, static taint analysis tools such as FlowDroid and IccTA provide hundreds of vulnerability candidates for data leakage, rather than confirmed vulnerabilities. Last but not least, there is a lack of a relatively complete vulnerability taxonomy, which will lead to a large number of false negatives. In this paper, based on our prior knowledge in this research area, we empirically propose a vulnerability taxonomy as a baseline and then extend it to 50 security vulnerability types by enhancing AUSERA's detection capabilities. At the same time, a new benchmark dataset containing these 50 vulnerability types was constructed to demonstrate the effectiveness of AUSERA. Tools and datasets can be found at [https://github.com/tjusenchen/AUSERA↗](https://github.com/tjusenchen/AUSERA), and demo videos can be found at [https://youtu.be/ UCiGwVaFPpY found ↗](https://youtu.be/UCiGwVaFPpY found).

Paper link: https://doi.org/10.1145/3551349.3559524

8、An Empirical Study of Automation in Software Security Patch Management

Several studies have shown that automating the different activities that support the security patch management process has great potential to reduce delays in installing security patches. However, it is equally important to understand how automation is used in practical applications, its limitations in meeting real-world needs, and what practitioners really need, aspects that have not yet been empirically investigated in the existing software engineering literature. This article reports on an empirical study aimed at investigating different aspects of security patch management automation through semi-structured interviews with 17 practitioners from three different organizations in the healthcare sector. The findings focus on the role of automation in security patch management to provide insights into the current state of automation in practical applications, the limitations of current automation, how automation support can be enhanced to effectively meet practitioner needs, and the role of humans in automated processes. Based on the findings, we provide a series of recommendations to guide future efforts to develop automated support for security patch management.

Paper link: https://doi.org/10.1145/3551349.3556969

9、Are they Toeing the Line? Diagnosing Privacy Compliance Violations among Browser Extensions

Browser extensions are integrated features in modern browsers designed to enhance your online browsing experience. Their vantage point between users and the Internet allows them to easily access users' sensitive data, which raises growing privacy concerns from legislators and, by extension, users. In this study, we propose an end-to-end approach to automatically diagnose privacy compliance violations in extensions. It analyzes privacy policy compliance with regulatory requirements as well as actual privacy-related practices at runtime. This approach can serve as an efficient and practical privacy compliance violation detection mechanism for extended users, developers, and app store operators. Our approach leverages the state-of-the-art language processing model BERT to annotate policy text and employs hybrid techniques to analyze the extension’s source code and runtime behavior. To facilitate model training, we construct a corpus named PrivAud-100, which contains 100 manually annotated privacy policies. Our large-scale diagnostic assessment shows that the vast majority of existing extensions suffer from privacy noncompliance issues. About 92% of these extensions had at least one privacy policy or data collection practice violation. Based on our findings, we further propose a metric that facilitates the filtering and identification of privacy non-compliant extensions with high accuracy (over 90%). Our work should raise awareness among extension users, service providers, and platform operators and encourage them to implement solutions to better comply with privacy compliance requirements. To facilitate future research in this area, we have released our dataset, corpus, and analyzer.

Paper link: https://doi.org/10.1145/3551349.3560436

10、Assessment of Automated (Intelligent) Toolchains

【Summary】

[Background:] The automatic intelligent tool chain is composed of different tools using artificial intelligence or static analysis, and is widely used in the deployment of automatic program repair technology in software engineering, or vulnerability identification in software security. [Overall Research Question:] Most studies on automated smart toolchains only report uncertainties and evaluations of individual components in the chain. How do we calculate the uncertainty and error propagation of the overall automation toolchain? [Method:] I plan to replicate the study case to collect data and devise a method to reconstruct the toolchain's overall correctness measure or identify missing variables. Further confirmatory experiments in humans will be conducted. Finally, I will implement a tool to automatically evaluate the overall automation tool chain. [Current Status:] Preliminary validation of published studies has been conducted, and the results show promising results.

Paper link: https://doi.org/10.1145/3551349.3559572

11、Augur: Dynamic Taint Analysis for Asynchronous JavaScript

Dynamic Taint Analysis (DTA) is a popular method used to help protect JavaScript applications from injection vulnerabilities. In 2016, the ECMAScript 7 JavaScript language standard introduced many language features that are not supported by most existing JavaScript DTA tools, such as the async/await keywords for asynchronous programming. We propose Augur, a high-performance dynamic taint analysis for ES7 JavaScript that leverages instrumentation technology supported by virtual machines. By integrating directly into a public and stable instrumentation API, Augur can run high-performance inside a virtual machine and be resistant to language changes. We extend the abstraction mechanism into DTA to handle asynchronous function calls. In addition to providing the traditional DTA use case of detecting injection vulnerabilities, Augur is also highly configurable and supports any kind of taint analysis, making it very useful outside the security field. We evaluated Augur on a set of 20 benchmarks and observed a median runtime overhead of only 1.77x, resulting in a median performance improvement of 298% compared to the previous state-of-the-art.

Paper link: https://doi.org/10.1145/3551349.3559522

12、Auto Off-Target: Enabling Thorough and Scalable Testing for Complex Software Systems

Billions of people every day rely on software systems built for operating system kernels, basebands, bootloaders, firmware, the Internet of Things, or automobiles. As the complexity of these systems continues to grow, and they are often written in unsafe languages such as C/C++, testing these systems is critical. However, there are significant challenges in testing such complex systems, such as custom hardware without emulators, or non-trivial setups for testing and debugging on target devices. Therefore, commonly used testing techniques and tools are not always easy to apply. Offline target (OT) testing is a promising technique for addressing these challenges: extracting and adapting parts of your code to run on different hardware platforms with better tool support, easier debugging, and higher test throughput. Unfortunately, since the process of creating OT programs has always been manual, this technology does not scale well and is mostly used in ad hoc ways. In this paper, we propose a new complex system testing method called Auto Off-target (AoT). Based on information extracted from the source code and build process, AoT can automatically generate OT programs in C language. AoT is more than just code generation, it also provides mechanisms to help recreate and discover program state in OT code. The generated OT program is self-contained and independent of the original build environment. As a result, portions of complex or embedded software can be easily run, analyzed, debugged, and tested on standard x86_64 machines. We evaluated AoT on tens of thousands of functions selected from the operating system kernel, bootloader, and network stack. We demonstrate that we can run fuzz testing and symbolic execution on most generated OT programs. We also used AoT in a vulnerability discovery campaign and discovered seven vulnerabilities in the Android redfin and oriole kernels that power Google Pixel 5 and 6 phones.

Paper link: https://doi.org/10.1145/3551349.3556915

13、Automated Identification of Security-Relevant Configuration Settings Using NLP

To protect our computer infrastructure, we need to configure all security-related settings. We need security experts to identify security-related settings, but this process is time-consuming and expensive. Our proposed solution leverages state-of-the-art natural language processing techniques to classify settings into security-relevant settings based on descriptions. Our evaluation results show that the classifier we trained does not perform well enough to replace human security experts, but can help them classify settings. By releasing our labeled dataset and the code for training the model, we hope to help security experts analyze configuration settings and stimulate further research in this area.

Paper link: https://doi.org/10.1145/3551349.3559499

14、Automatic Software Timing Attack Evaluation & Mitigation on Clear Hardware Assumption

Embedded systems are widely used to implement various Internet of Things (IoT) applications. These applications often involve secret/sensitive data and encryption keys that may be leaked through timing side-channel analysis. Runtime-based timing side-channel attacks are carried out by measuring the time it takes for code to execute and using that information to extract sensitive data. Due to the runtime dependence of software code on the underlying hardware, effectively detecting such vulnerabilities with high accuracy and low false positives is a challenging task. Due to the diversity of embedded systems, effectively and low-overhead fixing such vulnerabilities is also a non-trivial task. In this paper, we propose an automated runtime side-channel vulnerability detection and mitigation framework that not only considers software code but also leverages underlying hardware architecture information to adapt the framework to achieve more accurate vulnerability detection and customized mitigation for specific systems.

Paper link: https://doi.org/10.1145/3551349.3559516

15、CBMC-SSM: Bounded Model Checking of C Programs with Symbolic Shadow Memory

Dynamic program analysis tools such as Eraser, TaintCheck or ThreadSanitizer abstract the contents of individual memory locations and store the abstracted results in independent data structures called shadow memory. They then leverage this meta-information to efficiently implement the actual analysis. This paper describes an implementation of an efficient symbolic shadow memory extension for CBMC bounded model checkers, accessible via an API, and outlines its application in the design of a new data race analyzer implemented via code transformation. Tool link: [https://doi.org/10.5281/zenodo.7026604](https://doi.org/10.5281/zenodo.7026604) Video link: [https://youtu.be/pqlbyiY5BLU](https: //youtu.be/pqlbyiY5BLU)

Paper link: https://doi.org/10.1145/3551349.3559523

16、CoditT5: Pretraining for Source Code and Natural Language Editing

Pretrained language models have shown effectiveness in many software-related generation tasks; however, they are not suitable for editing tasks because they are not designed for editorial inference. To address this problem, we propose a novel pre-training objective that explicitly models editing, and use it to build CoditT5, a tool for software-related editing tasks that is pre-trained on large amounts of source code and natural language annotations. Large language models. We fine-tuned it on a variety of downstream editing tasks, including annotation updates, bug fixes, and automated code reviews. By going beyond standard generative-based models, we demonstrate the generalization capabilities of our approach and its performance to suit editing tasks. We also show how standard generative models and our editing-based model complement each other through a simple reordering strategy, through which we achieve state-of-the-art performance on these three downstream editing tasks.

Paper link: https://doi.org/10.1145/3551349.3556955

17、Compiler Testing using Template Java Programs

We present JAttack, a template-based testing framework for compilers. Using JAttack, developers write a template program that describes a set of programs to be generated and provided as test input to the compiler. Such a framework enables developers to incorporate their domain knowledge of the test compiler, providing a basic program structure that allows exploration of complex programs that can trigger sophisticated compiler optimizations. The developer writes a template program in the host language (Java) that contains the gaps filled by JAttack. Each opening is written in a domain-specific language, building a node in an Extended Abstract Syntax Tree (eAST). The eAST node defines an empty search space, which is a set of expressions and values. JAttack generates programs by executing templates and randomly selecting expressions and values within a search space defined by vacancies. Furthermore, we introduce several optimization methods to reduce the JAttack generation cost. While JAttack can be used to test various compiler features, we demonstrate its ability to help test just-in-time (JIT) Java compilers where optimizations occur at runtime after a sufficient number of executions. Using JAttack, we discovered six critical bugs confirmed by Oracle developers. Four of them were previously unknown, including two unknown CVEs (Common Vulnerabilities and Exposures). JAttack demonstrates the ability to combine developers' domain knowledge (via templates) with random testing to detect bugs in the JIT compiler.

Paper link: https://doi.org/10.1145/3551349.3556958

18、Cornucopia : A Framework for Feedback Guided Generation of Binaries

Binary analysis is an important capability required for many security and software engineering applications. Therefore, many binary analysis techniques and tools exist with varying capabilities. However, testing these tools requires a large and diverse set of binary data containing corresponding source code level information. This paper introduces Cornucopia, an architecture-agnostic automation framework that leverages compiler optimizations and feedback-guided learning to generate large numbers of binaries from corresponding program source code. Our evaluation results show that Cornucopia is able to generate 309K binaries on four architectures (x86, x64, ARM, MIPS), with an average of 403 binaries per program, and outperforms similar technology BinTuner\[53\] . Our experiments revealed an issue with LLVM's optimized scheduler, causing the compiler to crash (about 300 times). We evaluated four popular binary analysis tools angr, Ghidra, ida, and radare using binaries generated by Cornucopia and discovered various issues with these tools. Specifically, we found 263 crashes in angr and one memory corruption issue in ida. Our differential testing of the analysis results revealed various semantic errors in these tools. We also tested machine learning tools Asm2Vec, SAFE, and Debin, which claim to capture binary semantics, and found that they performed poorly on Cornucopia-generated binaries (e.g., Debin's F1 score dropped from the reported 63.1% to 12.9%). In summary, the results of our exhaustive evaluation show that Cornucopia is an effective mechanism for efficiently generating binaries for testing binary analysis techniques.

Paper link: https://doi.org/10.1145/3551349.3561152

19、Coverage-based Greybox Fuzzing with Pointer Monitoring for C Programs

C language has always been regarded as the main programming language for system software implementation. However, due to its low-level memory control, the C language often faces various memory vulnerabilities. To enhance memory safety, many methods have been proposed, among which coverage-based gray-box fuzzing (CGF) is very popular due to its practicality and satisfactory results. However, CGF identifies vulnerabilities based on captured crashes and therefore cannot detect non-crash vulnerabilities. This article considers tracking pointer metadata (state, bounds, and references) to detect additional vulnerabilities. Furthermore, since pointers in C are often directly related to memory operations, we designed two standards that further leverage pointer metadata as guidance for CGF to make the fuzzing process target vulnerable parts of the program.

Paper link: https://doi.org/10.1145/3551349.3559566

20、CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code

In recent years, there has been a surge in efforts to predict source code, such as code completion, code migration, program repair, or translation of natural language into code. A challenge faced by all these works is how to evaluate the quality of prediction results, usually measured in the form of reference solutions. A common evaluation metric is the BLEU score, an n-gram based metric originally used to evaluate natural language translation but has been adopted in the software engineering community because it can be easily calculated on any programming language, and enables large-scale automated assessment. However, a key difference between natural languages and programming languages is that in programming languages, completely unrelated pieces of code may have many n-grams in common simply because of the verbose syntax and coding conventions of programming languages. We observe that these meaningless shared n-grams hinder the metric's ability to differentiate between truly similar code examples and those simply written in the same language. This article proposes a BLEU-based evaluation index, CrystalBLEU, which can accurately and efficiently measure code similarity. Our metric retains the beneficial properties of BLEU, such as being language-agnostic, able to handle incomplete or partially incorrect codes, and computationally efficient while reducing interference caused by meaningless shared n-grams. We evaluate CrystalBLEU on two datasets from previous work and a new labeled equivalent program dataset. The results show that CrystalBLEU is able to effectively distinguish between similar and dissimilar code examples with a 1.9-4.5x improvement compared to the original BLEU scores and previously proposed code-specific BLEU variants.

Paper link: https://doi.org/10.1145/3551349.3556903

21、Dancing, not Wrestling: Moving from Compliance to Concordance for Secure Software Development

In recent years, secure software development has become an increasingly important focus of research, closely related to advances in technologies such as artificial intelligence (AI), machine learning (AI/ML), robotics, and autonomous systems (RAS). AI/ML and RAS facilitate automated decision-making and have the ability to have a significant impact on society. Therefore, the technology needs to be trustworthy, and secure software development is one of the key attributes of trust. Software developers are often responsible for delivering secure code and have responsibility and accountability, but often do not have decision-making authority over how security is achieved. Decision-making power often rests in the hands of cybersecurity professionals, who dictate security processes, tools, and training, often with limited success. The goal of our research is to better understand how to bridge the gap between software developers and cybersecurity practitioners so that decision-making authority, responsibility, and accountability are equally shared. We draw inspiration from studying the relationship between compliance, adherence, and coordination in healthcare. We provide this study as a perspective by analyzing qualitative data from 35 professional software developers. Our research shows that when software developers and cybersecurity professionals achieve coordination in their interactions, it may lead to the negotiation of more realistic cybersecurity solutions and eliminate software developer frustration, ultimately leading to more secure, trustworthy systems. .

Paper link: https://doi.org/10.1145/3551349.3561145

22、Data Leakage in Notebooks: Static Detection and Better Processes

The processes used in data science to train and evaluate machine learning models can be buggy like other code. Leakage between training and test data can lead to overestimation of model accuracy in offline evaluations, which can lead to low-quality models being deployed in production environments. Such leaks can easily occur due to errors or following poor practices, but manual detection can be tedious and challenging. We developed a static analysis method for detecting common forms of data leakage in data science code. Our evaluation results show that our analysis is able to accurately detect data leaks and that such leaks are prevalent among the more than 100,000 public notebooks analyzed. We discuss how our static analysis approach can help practitioners and educators, and how leak prevention can be designed into the development process.

Paper link: https://doi.org/10.1145/3551349.3556918

23、Detecting Semantic Code Clones by Building AST-based Markov Chains Model

Code clone detection aims to find functionally similar code fragments and is increasingly important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to large codes due to the complexity of the tree structure. In this paper, we design a scalable tree-based semantic code clone detector Amian by building a Markov chain model. Specifically, we propose a new method to convert original complex trees into simple Markov chains and measure the distances of all states in these chains. After obtaining all distance values, we feed them into a machine learning classifier to train a code clone detector. To test the effectiveness of Amian, we evaluate it on two widely used datasets, Google Code Jam and BigCloneBench. Experimental results show that Amian outperforms nine state-of-the-art code clone detection tools (i.e., SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, FCCA, DeepSim, and SCDetector).

Paper link: https://doi.org/10.1145/3551349.3560426

24、Do Regional Variations Affect the CAPTCHA User Experience? A Comparison of CAPTCHAs in China and the United States

Global systems use CAPTCHA as a security mechanism to protect against unauthorized automated access. Typically, the effectiveness of a CAPTCHA is evaluated based on its ability to fight against bots. User perceptions of the interactive experience and effectiveness of CAPTCHAs have received less attention, especially comparing the CAPTCHA variants presented in different regions around the world. As a first step to fill this gap, we conducted semi-structured interviews with ten participants fluent in both Chinese and English to investigate whether user perceptions are affected by CAPTCHA variants presented in China and the United States. We found significant differences in user experience and effectiveness across CAPTCHA types, but not between regional variants of the same type. Our findings point to multiple ways to make the CAPTCHA user experience more universal and inclusive.

Paper link: https://doi.org/10.1145/3551349.3561146

25、Effectively Generating Vulnerable Transaction Sequences in Smart Contracts with Reinforcement Learning-guided Fuzzing

As computer programs run on the blockchain, smart contracts are widely used in many decentralized applications, but they also bring security vulnerabilities that can lead to huge financial losses. Therefore, detecting smart contract vulnerabilities is very critical and urgent. However, existing smart contract fuzzers are still unable to efficiently detect complex vulnerabilities that require a specific sequence of vulnerable transactions to trigger. To address this challenge, we propose an innovative vulnerability-guided fuzzer based on reinforcement learning, named RLF, for generating vulnerable transaction sequences to detect these complex vulnerabilities in smart contracts. Specifically, we first model the process of smart contract ambiguity as a Markov decision process to construct our reinforcement learning framework. We then creatively design a suitable reward taking into account both vulnerability and code coverage to effectively guide our fuzzer to generate specific transaction sequences to reveal vulnerabilities, especially those related to multiple functions. We conduct extensive experiments to evaluate the performance of RLF. Experimental results show that our RLF performs better than existing vulnerability detection tools within 30 minutes (e.g., detects 8% to 69% more vulnerabilities within 30 minutes).

Paper link: https://doi.org/10.1145/3551349.3560429

26、Efficient Greybox Fuzzing to Detect Memory Errors

Gray box fuzz testing is a proven testing method for detecting security vulnerabilities and other bugs in modern software systems. Gray box fuzz testing can also be used in combination with address sanitizers (such as AddressSanitizer) to further improve the detection of certain classes of errors (such as buffer overflows and use-after-free errors). However, sanitizers also introduce additional performance overhead, which may reduce the performance of gray-box mode fuzzing. For example, fuzzing with ASAN may suffer a performance drop of 2.36x, thus partially offsetting the benefits of using sanitizers. Recent research attributes the additional overhead to program startup/shutdown costs, which may dominate fork-mode fuzz testing. In this paper, we propose a novel memory error sanitizer design specifically optimized for fork mode fuzz testing. The basic idea is to mark object boundaries using random markers rather than the discontinuous metadata used in traditional sterilizer designs. All read/write operations are then outfitted with code that checks for the flag, and if the flag is present, a memory error is detected. Since our design does not use discontinuous metadata, it is very lightweight, which means that the startup and shutdown costs of the program are minimized for the benefits of fork mode fuzz testing. We implemented our design as a tool and demonstrated improved fuzzing performance overhead of 1.14-1.27x, depending on the configuration.

Paper link: https://doi.org/10.1145/3551349.3561161

27、Empirical Study of System Resources Abused by IoT Attackers

In recent years, Internet of Things (IoT) devices have been frequently attacked, causing serious impact. Previous research has demonstrated the evolution and characteristics of some specific IoT malware families or IoT attack stages through offline sample analysis. However, we still lack systematic observations of the various system resources abused by active attackers and the malicious intent behind these actions. This makes it difficult to design appropriate protection strategies against existing attacks and possible future variants. In this paper, we fill this gap by analyzing 117,862 valid attack sessions captured by HoneyAsclepius, our purpose-built, highly interactive IoT honeypot, and discover intent within our designed workflows. HoneyAsclepius enables high capture capabilities and continuous behavioral monitoring during real-time active attack sessions. Through large-scale deployment, we collected 11,301,239 malicious behaviors from 50,594 different attackers. Based on this information, we further separate the behavior into different attack sessions targeting different system resource categories, estimate the temporal relationship between them and summarize their malicious intent. Inspired by these investigations, we present several key insights into the misuse of file, network, process, and special-capability resources and further propose practical defense strategies to better protect IoT devices.

Paper link: https://doi.org/10.1145/3551349.3556901

28、Enriching Compiler Testing with Real Program from Bug Report

Researchers have proposed various methods for generating test programs. State-of-the-art methods can be broadly divided into random-based and mutation-based methods: random-based methods generate random programs, while mutation-based methods generate more test programs by mutating the programs. Both methods mainly generate random code, but using real programs is more beneficial because the effects of compiler errors are easier to learn and the use of valid and invalid code is justified. However, most real programs from code repositories are not good at triggering compiler errors, in part because they are compiled before being submitted. In this experimental paper, we apply the two techniques of differential testing and code snippet extraction to the specific research area of compiler testing. Based on our observations of compiler testing practices, we consider compiler error reports as a new source for compiler testing. To illustrate the benefits of new sources, we implemented a tool called LeRe that extracts test programs from bug reports and uses differential testing to detect compiler errors when using the extracted programs. After enriching the test program, we found 156 unique bugs in the latest versions of gcc and clang. Of these, 103 bugs have been confirmed as valid and 9 bugs have been fixed. The bugs we found included 59 bugs accepting invalid programs and 33 bugs rejecting valid programs. The new source enables us to detect accept-invalid and reject-valid errors that are usually ignored by previous methods. Previous methods rarely reported these two types of errors. In addition to the bugs we found, we also performed an analysis of invalid bug reports. These results are useful to programmers when switching from one compiler to another, and also provide insights for researchers in applying differential testing to detect bugs in more types of software.

Paper link: https://doi.org/10.1145/3551349.3556894

29、Finding Property Violations through Network Falsification: Challenges, Adaptations and Lessons Learned from OpenPilot

Openpilot is an open source system that helps drivers by providing features such as automatic lane centering and adaptive cruise control. Like most self-driving systems, Openpilot relies on a complex deep neural network (DNN) to provide its functionality, and such networks are prone to safety issues that can lead to accidents. To uncover this potential issue before deployment, we looked at an approach called forgery, which is a targeted testing method that analyzes DNNs to generate inputs that cause security issues. Specifically, we explore applying state-of-the-art forgers to DNNs used in OpenPilot, which reflect recent trends in network design. Our study reveals the challenges in applying such forgers to real-world DNNs, conveys our engineering efforts to overcome these challenges, and demonstrates the potential of the forger to detect property violations and provide meaningful counterexamples. Finally, we summarize the lessons learned and the outstanding challenges for forgers to realize their potential in systems like OpenPilot.

Paper link: https://doi.org/10.1145/3551349.3559500

30、FuzzerAid: Grouping Fuzzed Crashes Based On Fault Signatures

Fuzz testing has always been an important method for discovering vulnerabilities and problems in programs. Many fuzzers used in industry are run every day and can generate a large number of crashes. Diagnosing these crashes can be very challenging and time-consuming. Existing fuzzers typically use heuristics, such as code coverage or call stack hashing, to exclude duplicate reported vulnerabilities. Although these heuristics are less expensive, they are often not accurate enough, resulting in many "unique" crashes corresponding to the same vulnerability still being reported. This article introduces an approach called FuzzerAid that uses fault signatures to group crashes reported by fuzzers. A fault signature is a small executable program consisting of some necessary statements selected from the original program that can reproduce the vulnerability. In our approach, we first generate a fault signature using a given crash. We then perform fault signatures using other inputs that cause crashes. If the failure is reproduced, we classify the crash into a group with a fault signature; if not, we generate a new fault signature. After classifying all inputs that lead to a crash, we further merge failure signatures with the same root cause into a group. We implemented our method in a tool called FuzzerAid and evaluated it on 15 real vulnerabilities and 3020 crashes generated by 4 large open source projects. Evaluation results show that we are able to correctly group 99.1% of crashes and report only 17 (+2) "unique" vulnerabilities, outperforming current state-of-the-art fuzzers.

Paper link: https://doi.org/10.1145/3551349.3556959

31、Fuzzle: Making a Puzzle for Fuzzers

With the rapid development of fuzz testing technology, the demand for automatic synthesis of defective programs continues to increase. Previous approaches have mainly focused on injecting errors into existing programs, resulting in generated programs that may contain unexpected errors and thus fail to provide realistic benchmarks. In this paper, we address this challenge by transforming the error synthesis problem into a maze generation problem. Specifically, we synthesize a complete defective program by encoding a sequence of moves in a maze as a chain of function calls. By design, our approach provides accurate realism of synthetic benchmarks. Furthermore, it allows the generation of benchmarks with realistic path constraints extracted from existing vulnerabilities. We implement our idea in a tool called Fuzzle and evaluate it using five state-of-the-art fuzzing tools to empirically demonstrate its value.

Paper link: https://doi.org/10.1145/3551349.3556908

32、GLITCH: Automated Polyglot Security Smell Detection in Infrastructure as Code

Infrastructure as Code (IaC) is the process of managing IT infrastructure through programmable configuration files (also known as IaC scripts). Like other software artifacts, IaC scripts may contain security issues, that is, coding patterns that may lead to security weaknesses. Automated analysis tools exist to detect security issues in IaC scripts, but they focus on specific technologies such as Puppet, Ansible or Chef. This means that when new security issue detection is implemented in one tool, it does not immediately apply to technologies supported by other tools, and the only option is to duplicate the effort. This paper proposes a method to achieve consistent security issue detection across different IaC technologies. We conduct a large-scale empirical study analyzing security issues in three large datasets containing 196,755 IaC scripts and 12,281,251 lines of code. We show that all categories of security issues can be identified in all datasets and identify some issues that may affect many IaC projects. To conduct this research, we developed a new technology-agnostic framework called GLITCH that enables automated multilingual security issue detection by converting IaC scripts into intermediate representations, where different security issue detectors can be defined. GLITCH currently supports detection of nine different security issues in scripts written in Ansible, Chef or Puppet. We compared GLITCH with state-of-the-art security issue detectors. The results obtained not only show that GLITCH can reduce the workload of writing security issue analysis for multiple IaC technologies, but also that its precision and recall are higher than the current state-of-the-art tools.

Paper link: https://doi.org/10.1145/3551349.3556945

33、Generalizability of Code Clone Detection on CodeBERT

Transformer networks like CodeBERT have achieved outstanding code clone detection results on benchmark datasets, so one might think that this task has been solved. However, code clone detection is not a simple task. In particular, semantic code cloning is more challenging. We demonstrate the reduced generalization ability of CodeBERT by evaluating Java code clones from two different subsets of BigCloneBench. We observe a significant drop in F1 scores when we evaluate different code snippets and feature IDs, which are different from the code snippets and feature IDs used to build the model.

Paper link: https://doi.org/10.1145/3551349.3561165

34、Generating Critical Test Scenarios for Autonomous Driving Systems via Influential Behavior Patterns

Autonomous Driving Systems (ADSs) are safety-critical and must be thoroughly tested before being deployed on real roads. In order to comprehensively evaluate the performance of ADSs, it is crucial to generate various safety-critical scenarios. Most existing studies evaluate ADSs either by searching high-dimensional input spaces or using simple and predefined test scenarios, neither of which are efficient or adequate. To better test ADSs, this paper proposes a method to automatically generate safety-critical test scenarios for ADSs by mining influential behavioral patterns in real traffic trajectories. Based on influential behavioral patterns, a novel scenario generation technology, CRISCO, is proposed to generate safety-critical scenarios for ADSs testing. CRISCO generates different test scenarios by solving trajectory constraints and increases the challenge of those non-critical scenarios by gradually adding actor behaviors in influential behavior patterns. We demonstrated CRISCO on the industrial-grade ADS platform Baidu Apollo. Experimental results show that our method is able to effectively and efficiently generate critical scenarios to crash ADS and expose 13 different types of security violations within 12 hours. On the same path, it also surpassed two state-of-the-art ADS testing techniques by exposing 5 different types of security violations.

Paper link: https://doi.org/10.1145/3551349.3560430

35、Griffin : Grammar-Free DBMS Fuzzing

Fuzz testing is a promising approach to database management system (DBMS) testing. In DBMS fuzz testing, grammar is a crucial component: since DBMS strictly validates input, grammar improves the efficiency of fuzzing by generating syntactically and semantically correct SQL statements. However, due to the huge differences in the complex syntax of various DBMSs, tuning these fuzzers to them is very time-consuming. Considering that many DBMS have not been fully tested, an effective DBMS fuzzing method that does not rely on syntax is urgently needed. In this paper, we propose Griffin, a mutation-based syntax-free DBMS fuzzing tool. Unlike dependency syntax, Griffin summarizes the state of a DBMS into a metadata graph, a lightweight data structure that improves mutation correctness in fuzz testing. Specifically, it first tracks the metadata of statements when executing built-in SQL test cases, and iteratively builds a metadata graph to describe the dependencies between metadata and statements. Based on these graphs, it rearranges statements and uses metadata-guided replacement to correct semantic errors. We evaluated Griffin on MariaDB, SQLite, PostgreSQL, and DuckDB. Griffin found 27, 27, and 22 more bugs in 12 hours than SQLancer, SQLsmith, and Squirrel respectively, and covered 73.43%-274.70%, 80.47%-312.89%, and 43.80%-199.11% more branch. Overall, Griffin found 55 previously unknown bugs and assigned 13 CVE identifiers.

Paper link: https://doi.org/10.1145/3551349.3560431

36、HTFuzz: Heap Operation Sequence Sensitive Fuzzing

Heap-based timing vulnerabilities (such as use-after-free, double-free, and null pointer dereference) are very sensitive to the sequence of heap operations (such as memory allocation, deallocation, and access). To effectively discover such vulnerabilities, traditional code coverage-guided fuzzing solutions can be improved by integrating heap operation sequence feedback. But current sequence-sensitive solutions have limitations in practice. This paper proposes a new fuzz testing solution called HTFuzz for discovering heap-based temporal vulnerabilities. At the core, we leverage fuzz testing to increase runtime coverage of the sequence of heap operations and the diversity of pointers accessed by these operations, with the former reflecting the control flow and the latter reflecting the data flow of the heap operation sequence. With this increase, fuzz testing tools can discover more heap-based timing vulnerabilities. We have developed a prototype of HTFuzz and evaluated it on 14 real applications and compared it with 11 state-of-the-art fuzzing tools. The results show that HTFuzz outperforms all benchmarks and is statistically better in the number of heap-based timing vulnerabilities found. Specifically, HTFuzz found (1.82x, 2.62x, 2.66x, 2.02x more) than (AFL, AFL-sensitive-ma, AFL-sensitive-mw, Memlock, PathAFL, TortoiseFuzz, MOPT, Angora, Ankou) respectively. , 2.21 times, 2.06 times, 1.47 times, 2.98 times, 1.98 times) heap operation sequence and (1.45 times, 3.56 times, 3.56 times, 4.57 times, 1.78 times, 1.78 times, 1.68 times, 4.00 times, 1.45 times) 0day heap-based timing vulnerability. HTFuzz discovered a total of 37 new vulnerabilities and assigned 37 CVE numbers, including 32 new heap-based temporal vulnerabilities and 5 other types of vulnerabilities.

Paper link: https://doi.org/10.1145/3551349.3560415

37、ICEBAR: Feedback-Driven Iterative Repair of Alloy Specifications

Automated program repair (APR) technology has achieved great success in automatically finding repairs for programs in programming languages such as C or Java. In this research, we focus on fixing formal specifications, specifically for the Alloy specification language. Unlike most APR tools, our approach to fixing Alloy specifications, called ICEBAR, does not use test-based oracles for patch evaluation. Instead, ICEBAR relies on attribute-based oracles, which typically appear in Alloy specifications as predicates and assertions. These attribute-based oracles define stronger conditions for patch evaluation, thereby reducing the notorious overfitting problem caused by using test-based oracles, which is commonly observed in APR environments. Furthermore, since assertions and predicates are intrinsic to Alloy, while test cases are not, our tool is more attractive to Alloy users than test-based Alloy fix tools. At a high level, ICEBAR is an iterative process based on counterexamples that generates and verifies repair candidates. ICEBAR receives a faulty Alloy specification and a failing attribute-based oracle, builds tests using Alloy's counterexamples and feeds them to ARepair, a test-based Alloy repair tool, to generate repair candidates. The candidates are then checked against attribute oracles for overfitting: if the candidate passes, a fix has been found; if not, further counterexamples are generated to build tests and enhance the test suite, and the process is iterated. ICEBAR includes different mechanisms, with varying degrees of reliability, for generating counterexamples from failing predicates and assertions. Our evaluation results show that ICEBAR significantly outperforms ARepair in reducing overfitting and improving repair rate. Furthermore, ICEBAR shows that through iterative improvements we are able to significantly improve the state-of-the-art tool for automatically repairing Alloy specifications without requiring any modifications to the tool.

Paper link: https://doi.org/10.1145/3551349.3556944

38、Identification and Mitigation of Toxic Communications Among Open Source Software Developers

Toxic and unhealthy conversations among developers can reduce the professionalism and productivity of Free Open Source Software (FOSS) projects. For example, toxic code review comments may cause the author to object to a proposed change. Toxic communication with others can hinder future communication and collaboration. Research also shows that toxicity affects newcomers, women and participants from other marginalized groups more. Therefore, toxicity is a barrier to promoting diversity, equity, and inclusion. Since toxic communication is not uncommon in the FOSS community, and such communication can have serious consequences, the main goal of my proposed PhD thesis is to automatically identify and mitigate toxic developers in text communications. To achieve this goal, I intend to: i) build an automatic toxicity detector applicable to the Software Engineering (SE) domain, ii) define the concept of toxicity across different populations, and iii) analyze the impact of toxicity on Open Source Software (OSS) impact on project results.

Paper link: https://doi.org/10.1145/3551349.3559570

39、Identifying Sexism and Misogyny in Pull Request Comments

Software development organizations are extremely gender-skewed and lack diversity. People from other groups often encounter sexism, misogyny, and discriminatory comments during interactions. To identify this content, I aimed to build an Automatic Misogyny Identification Tool (AMI) for the software development community. To achieve this goal, I mined a dataset of 10,138 pull request comments on Github based on keyword selection and manual verification. Using ten-fold cross-validation, I evaluated ten machine learning algorithms for automatic discrimination. The best-performing model achieved 80% precision, 67.07% recall, 72.5% F1 score, and 95.96% accuracy.

Paper link: https://doi.org/10.1145/3551349.3559515

40、Insight: Exploring Cross-Ecosystem Vulnerability Impacts

So-called CLV issues are vulnerabilities caused by cross-language calls to vulnerable libraries. Since Python/Java projects make extensive use of C libraries, these issues greatly increase their attack surface. Existing Python/Java build tools in the PyPI and Maven ecosystem failed to report dependencies on vulnerable libraries written in other languages, such as C. It's easy for developers to overlook CLV issues. This paper provides the first empirical study of the current status of CLV issues in the PyPI and Maven ecosystems. The study found that 82,951 projects in these ecosystems depend directly or indirectly on libraries compiled from C project versions identified as vulnerable in CVE reports. Our study draws attention to the CLV issue in popular ecosystems and presents relevant analysis results. The research also led to the development of Insight, the first automated mechanism that provides a turnkey solution for CLV issue identification for PyPI and Maven projects based on published CVE reports of vulnerable C projects. Insight automatically determines whether a PyPI or Maven project uses a C library compiled from a vulnerable C project version, and infers this by analyzing the usage of various external function interfaces (such as CFFI and JNI) in the PyPI or Maven project. The vulnerable API involved. Insight achieves a high detection rate of 88.4% on the popular CLV problem benchmark. As a contribution to the open source community, we report 226 CLV issues detected in PyPI and Maven projects under active maintenance that directly depend on vulnerable C library versions. Our reports were well received by developers who raised questions about Insight's usability. 127 reported issues (56.2%) were quickly confirmed by developers, and 74.8% of them are/are being fixed in popular projects such as Mongodb \[40\] and Eclipse/Sumo \[19\] .

Paper link: https://doi.org/10.1145/3551349.3556921

41、Is this Change the Answer to that Problem?: Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness

In recent years, the correctness of patches has become the focus of automatic program repair (APR) because automatic program repair (APR) tools tend to generate overfitting patches. Given a generated patch, it is often difficult for a verifier (such as a test suite) to determine its correctness. Therefore, the literature proposes various methods to further explore the correctness of APR-generated patches by leveraging machine learning and engineering and deep learning features, or exploring dynamic execution information. In this work, we propose a new perspective to address the problem of patch correctness evaluation: a correct patch implements an "answer" to the question posed by erroneous behavior. Specifically, we transform patch correctness evaluation into a question and answer question. To address this problem, our intuition is that natural language processing can provide the necessary representations and models for evaluating the semantic correlation between bug reports (questions) and patches (answers). Specifically, we take as input bug reports and natural language descriptions of generated patches. Our approach, named Quatrain, first considers a state-of-the-art commit generation model to generate inputs associated with each generated patch. We then utilize a neural network architecture to learn the semantic correlation between bug reports and commit information. Experiments on a large dataset of 9,135 patches generated from three defect datasets (Defects4j, Bugs.jar, and Bears) show that Quatrain achieves an AUC of 0.886 in predicting patch correctness while filtering out 62% of In the case of incorrect patches, 93% of correct patches can be recalled. Our experimental results further demonstrate the impact of input quality on prediction performance. We conduct further experiments to highlight that the model indeed learns the relationship between bug reports and code change descriptions. Finally, we compare with previous work and discuss the advantages of our approach.

Paper link: https://doi.org/10.1145/3551349.3556914

42、Keeping Secrets: Multi-objective Genetic Improvement for Detecting and Reducing Information Leakage

Information leaks in software can inadvertently expose private data, but are difficult to detect and fix. Although several methods for detecting leaks have been proposed, such as methods based on static verification, they require expertise and are time-consuming. Recently, we introduced HyperGI, a dynamic hypertest-based approach that detects and generates potential fixes for hyperproperty violations. In particular, we focus on violations of non-interfering properties, since it can lead to information flow leakage. Our instantiated HyperGI is able to detect and reduce leaks in three applets. Its adaptability function attempts to strike a balance between information leakage and program correctness, but as we point out, maintaining program semantics and reducing information leakage may require developers to make trade-offs. In this work, we ask whether it is possible to automatically detect and repair information leaks in more realistic programs, without requiring specialized knowledge. We instantiate a multi-objective version of HyperGI in a tool called LeakReducer, which explicitly encodes the trade-off between program correctness and information leakage. We apply LeakReducer to six vulnerable programs including the famous Heartbleed vulnerability. LeakReducer is able to detect leaks in all programs compared to state-of-the-art fuzz testing tools, which only detect leaks in two programs. Furthermore, LeakReducer is able to reduce leaks in all tested programs with comparable results to previous work while being applicable to larger-scale software.

Paper link: https://doi.org/10.1145/3551349.3556947

43、LawBreaker: An Approach for Specifying Traffic Laws and Fuzzing Autonomous Vehicles

Autonomous Driving Systems (ADSs) must undergo thorough testing before being deployed in autonomous vehicles. High-fidelity simulators allow ADSs to be tested in a variety of scenarios, including those that are difficult to reproduce in actual test sites. Although previous methods have been shown to automatically generate test cases, they tend to focus on weak predictive models (e.g., collision-free arrival at the destination) without assessing whether the driving process is safe and complies with legal requirements. In this work, we propose an automated framework called LawBreaker for testing ADSs against real traffic regulations, which is compatible with different scenario description languages. LawBreaker provides a rich driver-oriented specification language to describe traffic laws, and a fuzzy engine to search for different ways to violate traffic laws by maximizing specification coverage. To evaluate our method, we implemented it as Apollo+LGSVL and specified Chinese traffic regulations. LawBreaker was able to uncover 14 violations of these regulations, including 173 test cases that resulted in accidents.

Paper link: https://doi.org/10.1145/3551349.3556897

44、Leveraging Artificial Intelligence on Binary Code Comprehension

Understanding binary code is an important and complex software engineering task for reverse engineering, malware analysis, and compiler optimization. Unlike source code, binary code has limited semantic information, which is challenging for human understanding. At the same time, compiling source code to binary code, or converting between different programming languages, can introduce external knowledge to binary code understanding. We propose to develop artificial intelligence (AI) models to help humans understand binary codes. Specifically, we propose to incorporate domain knowledge from large source code corpora (e.g., variable names, comments) into AI models to build models that capture universal representations of binary code. Finally, we will investigate model performance evaluation metrics suitable for binary code by using human understanding studies.

Paper link: https://doi.org/10.1145/3551349.3559564

45、MalWhiteout: Reducing Label Errors in Android Malware Detection

In recent years, machine learning-based Android malware detection has attracted a lot of research efforts. A reliable malware dataset is crucial to evaluate the effectiveness of malware detection methods. However, existing malware datasets in our community are mainly annotated by leveraging existing anti-virus services such as VirusTotal, an approach prone to mislabeling. This will lead to an accurate assessment of malware detection techniques. Removing label noise from Android malware datasets is quite challenging on large-scale datasets. To address this problem, we propose an effective method, called MalWhiteout, for reducing labeling errors in Android malware datasets. Specifically, we creatively introduce Confident Learning (CL), an advanced noise estimation method, into the field of Android malware detection. In order to deal with the false positive problem introduced by CL, we combine the ideas of ensemble learning and the relationship between applications to achieve more powerful noise detection capabilities. We evaluate MalWhiteout on a carefully curated large-scale and reliable benchmark dataset. Experimental results show that MalWhiteout is able to detect label noise with over 94% accuracy even under the high noise ratio of the dataset (30%). Under different settings, MalWhiteout outperforms existing research methods in terms of effectiveness (8% to 218% improvement) and efficiency (70 to 249 times faster). By reducing label noise, we show that the performance of existing malware detection methods can be improved.

Paper link: https://doi.org/10.1145/3551349.3560418

46、Multi-objective Optimization-based Bug-fixing Template Mining for Automated Program Repair

Template-based automatic program repair (T-APR) technology relies on the quality of repair templates. In order for these templates to be of sufficient quality to be successful in T-APR technology, they must meet three criteria: suitability, repairability, and efficiency. Existing template mining methods only select templates based on the first criterion and are therefore not optimized enough in terms of performance. This study proposes a T-APR repair template mining method based on multi-objective optimization, in which we estimate the quality of templates based on nine code abstraction tasks and three objective functions. Our approach identifies the optimal code abstraction strategy (i.e., the best combination of abstraction tasks) that maximizes the values of the three objective functions and generates the final Collection of repair templates. Our preliminary experiments show that compared to existing mining techniques, our optimization strategy can improve the applicability and efficiency of templates by 7% and 146% respectively. Therefore, we conclude that template mining techniques based on multi-objective optimization can effectively find high-quality repair templates.

Paper link: https://doi.org/10.1145/3551349.3559554

47、Not All Dependencies are Equal: An Empirical Study on Production Dependencies in NPM

Modern software systems often speed development by leveraging libraries and packages written by others. Although there are many benefits to using third-party software packages, software projects often rely on a large number of packages. As a result, developers face the difficult challenge of keeping project dependencies up-to-date and protected from security vulnerabilities. However, in real production, how often do project dependencies pose a threat to project security? We conduct an empirical study on 100 JavaScript projects using the Node Package Manager (npm) to quantify how often project dependencies are released in real production and analyze their characteristics and impact on security. Our results show that less than 1% of installed dependencies are released to actual production environments. Our analysis revealed the fact that functionality alone is not enough to determine whether a package will be released into actual production. In fact, 59% of runtime dependencies are not used in actual production, while 28.2% of development dependencies are used in actual production, overturning two common assumptions about dependency management. The findings also show that most security alerts target dependencies that are not used in actual production, making them highly unlikely to pose a risk to software security. Our research reveals a more complex side of dependency management: not all dependencies are created equal. Dependencies used in real production are more sensitive to security exposure and should be prioritized. However, current tools lack proper support in identifying actual production dependencies.

Paper link: https://doi.org/10.1145/3551349.3556896

48、Precise (Un)Affected Version Analysis for Web Vulnerabilities

Web applications have become attractive attack targets due to their popularity and high number of vulnerabilities. To mitigate the threat of web vulnerabilities, an important piece of information is the version they affect. However, constructing accurate affected version information is not trivial because confirming whether a version is affected requires security expertise and significant effort, often requiring checking hundreds of versions. As a result, this information in almost every public vulnerability database is maintained in a low-quality manner. Therefore, it is useful to have a tool that can automatically and accurately check whether most, if not all, software versions are affected. To this end, this paper proposes a vulnerability-based approach for precise analysis of (un)affected versions of web vulnerabilities. The key idea is to extract the vulnerability logic from the patch and use the vulnerability logic directly to check if a version is affected. Compared to existing approaches, our vulnerability-centric approach helps tolerate code changes between different software versions. We construct a high-quality dataset containing 34 CVEs and 299 software versions to evaluate our method. Results show that our method achieves 98.15% precision and 85.01% recall in identifying (un)affected versions, and significantly outperforms existing tools (such as V-SZZ, ReDebug, V0Finder).

Paper link: https://doi.org/10.1145/3551349.3556933

49、Privacy Analysis of Period Tracking Mobile Apps in the Post-Roe v. Wade Era

Period tracking apps have become extremely popular in recent years to help people manage their health. However, on June 24, 2022, the U.S. Supreme Court overturned Roe v. Wade. Abortion is banned in an increasing number of states. Because health data stored in period-tracking apps can be used to infer whether a user has had an abortion or is considering one, mobile users worry the apps could reveal their sensitive information and be used to sue users. Although period tracking apps have received attention from the research community, no existing studies have conducted a systematic privacy analysis of these apps, especially in the wake of Roe v. Wade. To fill this gap, this article provides a comprehensive privacy analysis of popular period tracking apps. We started by collecting 35 popular period tracking apps from Google Play. We then used traffic analysis and static analysis to analyze the sensitive user data collected by the period tracking application. Additionally, we examined their privacy policies and checked the consistency of the privacy policies with the application behavior. Additionally, we analyzed app reviews to understand users’ concerns about period tracking apps. Our research reveals that some period tracking apps do collect sensitive information and potentially share data with third parties. These apps urgently need to take measures to protect user privacy, and mobile users should pay special attention to the apps they use.

Paper link: https://doi.org/10.1145/3551349.3561343

50、Property-Based Automated Repair of DeFi Protocols

Programming errors can lead to security attacks on smart contracts, which are used to manage large financial assets. Automated Program Repair (APR) technology aims to relieve developers from the burden of manually fixing bugs by automatically generating patches for a given issue. Existing smart contract APR tools focus on mitigating typical smart contract vulnerabilities rather than functional specification violations. However, in decentralized finance (DeFi) smart contracts, inconsistencies between expected behavior and implementation lead to deviations from the underlying financial model, resulting in monetary losses to the application and its users. In this work, we propose DeFinery—a technique for automatically repairing smart contracts that do not satisfy user-defined correctness properties. To explore a larger and diverse set of patches while providing formal correctness guarantees related to expected behavior, we combine search-based patch generation with semantic analysis of the original program to infer its specification. We demonstrate in experiments on repairing nine real-world and benchmark smart contracts that DeFinery is able to efficiently generate high-quality patches that other existing tools cannot find.

Paper link: https://doi.org/10.1145/3551349.3559560

51、Reentrancy Vulnerability Detection and Localization: A Deep Learning Based Two-phase Approach

Smart contracts are widely and rapidly used alongside blockchain to automate financial and business transactions, helping people reach agreements while minimizing trust. As the number of smart contracts deployed on the blockchain continues to increase, various errors and vulnerabilities in smart contracts have also emerged. With the rapid development of deep learning, many recent studies use deep learning for vulnerability detection to perform security checks before deploying smart contracts. These methods have achieved effective results in detecting whether smart contracts have vulnerabilities, but the results in locating suspicious statements that lead to detected vulnerabilities are still unsatisfactory. To solve this problem, we propose a two-stage smart contract debugger based on deep learning for detecting one of the most serious vulnerabilities - recursive vulnerabilities, named ReVulDL: Recursive Vulnerability Detection and Localization. ReVulDL integrates vulnerability detection and location into a unified debugging process. For the detection stage, given a smart contract, ReVulDL uses a graph-based pre-trained model to learn complex relationships in the propagation chain to detect whether the smart contract contains recursive vulnerabilities. For the localization phase, if a recursion vulnerability is detected, ReVulDL leverages interpretable machine learning to locate suspicious statements in the smart contract to provide an explanation for the detected vulnerability. We conduct a large-scale empirical study on 47,398 smart contracts, and the results show that ReVulDL is effective at detecting recursive vulnerabilities (e.g., outperforming 16 state-of-the-art vulnerability detection methods) and locating vulnerable statements (e.g., 70.38% of vulnerable statements are in Promising results have been achieved in terms of ranking within the top 10).

Paper link: https://doi.org/10.1145/3551349.3560428

52、Right to Know, Right to Refuse: Towards UI Perception-Based Automated Fine-Grained Permission Controls for Android Apps

Users have the right to know how permissions are used within the scope of an Android application and can deny permissions granted to an application for activities other than their intended purpose, which may constitute malicious conduct. This paper proposes a method and vision to automatically model the permissions required by Android applications from the user's perspective and enable fine-grained permission control by the user, thereby helping users make more comprehensive and flexible permission decisions for different application functions. , thereby improving application security and data privacy, and forcing applications to reduce permission abuse. Our proposed method is mainly divided into two stages. First, program analysis techniques are used to look for differences between user-perceived permission usage and the permissions actually used by the application. Second, use machine learning technology to run predictive algorithms that capture differences in permission usage to alert users to take action to prevent data breaches. We evaluate an initial implementation of our approach and achieve promising fine-grained permission control accuracy. In addition to the benefits of user privacy protection, we expect that wider adoption of this approach will also lead to better privacy-aware designs by responsible institutions such as app developers, governments, and enterprises.

Paper link: https://doi.org/10.1145/3551349.3559556

53、SML4ADS: An Open DSML for Autonomous Driving Scenario Representation and Generation

Automated driving systems (ADS) require extensive safety assessments before being marketed. However, since relying solely on field testing is rarely feasible as sufficient distances cannot be covered to ensure adequate safety, the focus shifts to scenario-based testing. The challenge is to generate scenarios flexibly. We propose a scene modeling language for ADS (SML4ADS) as a domain-specific modeling language (DSML) for scene representation and generation. Compared to other existing work, our approach simplifies scene description in a non-programmatic, user-friendly manner, allowing the stochastic behavior of vehicles to be modeled and executable scenarios generated in CARLA. We applied SML4ADS to many typical scenarios, preliminarily demonstrating the effectiveness and feasibility of our method in modeling and generating executable scenarios.

Paper link: https://doi.org/10.1145/3551349.3561169

54、Scrutinizing Privacy Policy Compliance of Virtual Personal Assistant Apps

Among various virtual personal assistant services, such as Amazon Alexa, feature-rich and easy-to-access applications have become extremely popular. Virtual Personal Assistant Applications (VPA applications for short) come with privacy policy documents that inform users of their data handling practices. These files are often too long and complex for users, and developers may intentionally or unintentionally fail to comply with these policies. In this work, we conduct the first systematic study on privacy policy compliance issues of VPA applications. We developed Skipper for Amazon Alexa skills. It analyzes privacy policy documents through natural language processing (NLP) and machine learning techniques, automatically describes skills into declared privacy profiles, and derives behavioral privacy profiles for skills through black-box testing. We conducted a large-scale analysis of all skills listed on the Alexa Store and found that a large number of skills have privacy policy violations.

Paper link: https://doi.org/10.1145/3551349.3560416

55、SelfAPR: Self-supervised Program Repair with Test Execution Diagnostics

A series of recent papers have shown that learning-based program repair achieves promising results. However, we observed that related efforts failed to fix some bugs due to a lack of understanding of the application domain and fault type of the program being fixed. In this paper, we address these two issues by shifting the learning paradigm from supervised training to self-supervised training, a method called SelfAPR. First, SelfAPR generates on-disk training samples by perturbing previous versions of the program being repaired, forcing the neural model to capture project-specific knowledge. This is different from previous work based on mining past commits. Second, SelfAPR executes all training samples and extracts and encodes test execution diagnostic information into the input representation to guide the neural model to repair the fault type. This is different from existing studies that only consider static source code as input. We implemented SelfAPR and evaluated it in a systematic way. We generated 1,039,873 training samples obtained by perturbing 17 open source projects. We evaluated SelfAPR on 818 bugs from Defects4J, and it correctly fixed 110 of them, outperforming all supervised learning repair methods.

Paper link: https://doi.org/10.1145/3551349.3556926

56、Shibboleth: Hybrid Patch Correctness Assessment in Automated Program Repair

Automated Program Repair (APR) systems based on test generation and verification generate many patches that pass the test suite but do not fix the bug. The generated patches need to be manually reviewed by developers, which is a time-consuming task, thereby reducing the role of APR in reducing debugging costs. We present the design and implementation of a novel tool called Shibboleth for automatically evaluating patches generated by APR systems based on test generation and verification. Shibboleth leverages lightweight static and dynamic heuristics from test and production code to rank and classify patches. The basic idea of Shibboleth is that a buggy program is almost correct, bugs are small errors that can be fixed with only minor changes, and the fix does not remove the code that makes the program function correctly. Therefore, the tool separates patches that lead to similar programs and do not remove required program elements by measuring their impact on production code (via syntax and semantic similarity) and test code (via code coverage). We evaluated Shibboleth on 1,871 patches generated by 29 Java-based APR systems for the Defects4J program. This technique outperforms state-of-the-art ranking and classification techniques. Specifically, in our ranking dataset, Shibboleth ranked the correct patch in the top 1 or top 2 positions 66% of the time, and in our classification dataset, it achieved 0.887 and 0.887 respectively in classification mode. Accuracy and F1 score of 0.852. A demonstration video of the tool can be viewed at [https://bit.ly/3NvYJN8↗](viewed at https://bit.ly/3NvYJN8).

Paper link: https://doi.org/10.1145/3551349.3559519

57、Simulating cyber security management: A gamified approach to executive decision making

Executive managers do not all have the necessary cybersecurity expertise to be able to make business decisions that truly reflect the cybersecurity posture and needs of the organization. Unfortunately, a lack of understanding between the business and cybersecurity domains results in structural vulnerabilities in the business environment, where either the impact on cybersecurity is not understood when considering business needs, or the cybersecurity needs are not fully understood when considering them. Its impact on business strategy and financial stability. To address this dilemma, a solution is proposed that delivers cybersecurity training to executives in a gamified manner, aiming to not only minimize cyber vulnerabilities in the business environment but also improve business outcomes supported by cybersecurity measures. We developed a serious gaming software platform called Aurelius that simulates the role of executive decision makers in managing day-to-day cybersecurity investment decisions and combines this with business metrics to blend business and cybersecurity understanding. Our game includes simulated cybersecurity attacks that require an appropriate response from executive decision-makers (players). Our algorithm for simulating cybersecurity games is based on a complex systems approach because this most accurately simulates the experience of an executive. In our design, we set up Aurelius to meet eight of the nine criteria for advanced serious gaming in cybersecurity.

Paper link: https://doi.org/10.1145/3551349.3561148

58、So Many Fuzzers, So Little Time✱: Experience from Evaluating Fuzzers on the Contiki-NG Network (Hay)Stack

Fuzz testing ("fuzzing") is a widely used and effective dynamic technique for discovering crashes and security vulnerabilities in software. There are many tools that support fuzz testing, and these tools continue to improve in terms of detection capabilities and execution speed. This paper reports the results of our research using state-of-the-art mutation- and hybrid-based fuzzing tools (AFL, Angora, Honggfuzz, Intriguer, MOpt-AFL, QSym, and SymCC) on a very complex codebase, Contiki-NG. For more than three years, we have revealed and fixed critical vulnerabilities at every level of the software’s networking stack. As a spin-off, we provide a Git-based platform that allows us to create and apply a fairly challenging open source vulnerability suite for evaluating the performance of fuzz testing tools on real-world software vulnerabilities. Using this vulnerability suite, we provide an unbiased and comprehensive assessment of the effectiveness of these fuzzing tools and measure the impact of sanitizers on them. Finally, we provide our experiences and opinions on how fuzzing tools can be used and evaluated in the future.

Paper link: https://doi.org/10.1145/3551349.3556946

59、StandUp4NPR: Standardizing SetUp for Empirically Comparing Neural Program Repair Systems

Recently, a new trend in automatic program repair is to apply deep neural networks to generate repaired code from buggy code, called NPR (Neural Program Repair). However, existing NPR systems employ very different settings during training and evaluation (e.g., different training data, inconsistent evaluation data, wide range of candidate numbers), which makes it difficult to draw unbiased conclusions when comparing them. in conclusion. For this reason, we first built a standard benchmark dataset and an extensive framework tool to mitigate threats in comparisons. The dataset includes a training set, a validation set, and an evaluation set, containing 144,641, 13,739, and 13,706 pairs of Java bug fix samples respectively. The tool supports the selection of specific training, validation, and evaluation datasets and automates the training and evaluation process of NPR models, while easily integrating new NPR models by implementing well-defined interfaces. Then, based on benchmark datasets and tools, we conduct a comprehensive empirical comparison of six SOTA NPR systems in terms of repair ability, propensity, and generalization ability. The experimental results reveal deeper characteristics of the compared NPR systems and overturn some existing comparative conclusions, further validating the necessity of unifying experimental settings when exploring the progress of NPR systems. At the same time, we reveal some common characteristics of NPR systems (e.g., they are good at handling code deletion errors). Finally, we identify some promising research directions based on our findings.

Paper link: https://doi.org/10.1145/3551349.3556943

60、SymFusion: Hybrid Instrumentation for Concolic Execution

The dynamic variant of symbolic execution is an execution style that considers scalability. Recent symbolic executors rely heavily on program instrumentation to achieve scalability. Instrumentation code can be added at compile time (e.g. using an LLVM plugin) or directly at execution time with the help of a dynamic binary translator. The former approach produces more efficient code, but requires recompilation. Unfortunately, it is not always possible or practical to recompile the entire program code (such as when third-party components are present). In contrast, the latter approach does not require recompilation but incurs significantly increased execution time overhead. In this paper, we study a hybrid instrumentation approach for symbolic execution called SymFusion. In particular, this hybrid instrumentation approach allows users to recompile the core components of an application, thereby minimizing the overhead of profiling them, while still being able to dynamically instrument the remaining application components at execution time. Our experimental evaluation shows that our design achieves a good balance between efficiency and effectiveness on several real-world applications.

Paper link: https://doi.org/10.1145/3551349.3556928

61、ThirdEye: Attention Maps for Safe Autonomous Driving Systems

Automated online identification of unexpected situations is an essential component to ensure autonomous vehicles remain safe in unknown and uncertain situations. This paper proposes a runtime monitoring technique based on attention maps calculated using interpretable artificial intelligence techniques. Our method, implemented in a tool called ThirdEye, converts attention maps into confidence scores for distinguishing safe and unsafe driving behaviors. The basic idea is that uncommon attention maps are associated with unexpected runtime conditions. In our empirical study, we evaluate the effectiveness of different configurations of ThirdEye in predicting simulation-based injected faults including unknown conditions such as severe weather and lighting as well as unsafe/uncertain conditions created through mutation testing. Results showed that overall, ThirdEye could predict 98% of adverse behaviors three seconds in advance, outperforming a state-of-the-art failure predictor for autonomous vehicles.

Paper link: https://doi.org/10.1145/3551349.3556968

62、Towards Effective Static Analysis Approaches for Security Vulnerabilities in Smart Contracts

The popularity of smart contracts has brought an increase in security attacks against smart contracts, which has resulted in millions of dollars in financial losses and loss of trust. To help developers discover vulnerabilities in smart contracts, a variety of static analysis tools have been proposed. However, despite the many bug-finding tools, security vulnerabilities in smart contracts are still numerous and developers still rely on manual discovery of vulnerabilities. The goal of this paper is to expand the scope of security vulnerability detection by proposing an effective static analysis method. We study the effectiveness of existing static analysis tools and propose a solution to detect security vulnerabilities based on analyzing the dependence of contract code on user input. The results of our evaluation of static analysis tools indicate that existing static tools for smart contracts suffer from significant false negatives and false positives. Furthermore, the results show that our first vulnerability detection method achieves significant improvements over previous work in terms of effectiveness in detecting vulnerabilities.

Paper link: https://doi.org/10.1145/3551349.3559567

63、Towards Improving the Adoption and Usage of National Digital Identity Systems

User perception of the National Digital Identity System (NDID) significantly affects its usage and acceptance. Previous research on NDID use provides a limited framework for future research, with a primary emphasis on government services and how to improve the system. This study evaluates how human-centered cybersecurity factors influence users’ use and acceptance of NDID. For example, Australia's MyHealth records system was widely rejected by users due to concerns about unauthorized use of digital identity information and other privacy concerns. We hypothesize that human-centered cybersecurity factors influence users' use and acceptance of NDID. The study also has practical implications as it provides a framework for identifying human-centered cybersecurity factors that influence the adoption and improved use of NDID.

Paper link: https://doi.org/10.1145/3551349.3561144

64、TransplantFix: Graph Differencing-based Code Transplantation for Automated Program Repair

Automatic Program Repair (APR) is expected to aid manual debugging activities. After more than a decade of development, a wide variety of APR techniques have been proposed and evaluated using a set of real-world error datasets. However, although more and more bugs have been correctly fixed, we have observed that the growth of APR technology to fix new bugs has reached a bottleneck in recent years. In this work, we explore the possibility of solving complex errors by proposing TransplantFix, a novel APR technique for graph difference-based transplantation. The key innovations of TransplantFix include three aspects: 1) We propose to use a graph-based difference algorithm to extract semantic repair operations from donor methods; 2) We design an inheritance hierarchy-aware code search method to identify files with similar functions The donor method; 3) We propose a namespace transfer method to efficiently adapt to the donor code. We investigate its unique contribution by conducting an extensive comparison and evaluating TransplantFix on Defects4J v1.2 and v2.0. TransplantFix delivers superior results in three areas. First, it achieves the best performance in terms of the number of newly fixed bugs, with an improvement of 60%-300% compared to the state-of-the-art APR techniques proposed in the past three years. Furthermore, without relying on any manually crafted or learned repair operations from big data, it has the best generalization ability among all APR techniques evaluated on Defects4J v1.2 and v2.0. Furthermore, it demonstrates the potential to synthesize complex patches consisting of up to eight rows of insertions. TransplantFix provides new insights and a promising research direction to solve more complex errors.

Paper link: https://doi.org/10.1145/3551349.3556893

65、TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection

Code clone detection is an important research issue that has attracted widespread attention in software engineering. Many methods have been proposed for detecting code clones, among which text-based and markup-based methods are highly scalable but lack consideration of code semantics and therefore fail to detect semantic code clones. Methods based on code intermediate representation can solve the problem of semantic code clone detection. However, graph-based methods are not feasible due to code compilation reasons, and existing tree-based methods are limited by tree size in scalable code clone detection. In this paper, we propose TreeCen, a scalable tree-based code clone detector that satisfies scalability while effectively detecting semantic clones. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and convert it into a simple graph representation (i.e., tree graph) based on node types without adopting traditional heavy tree matching methods . We then treat the tree graph as a social network and perform centrality analysis on each node to maintain the details of the tree. Through such processing, the original complex tree can be transformed into a 72-dimensional vector, which also contains comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and used to spot code clones. We perform a comparative evaluation of performance and scalability. Experimental results show that TreeCen achieves F1 scores of 0.99 and 0.95 on the BigCloneBench and Google Code Jam datasets respectively, maintaining the best performance of the other six state-of-the-art methods (i.e., SourcererCC, RtvNN, DeepSim, SCDetector, Deckard, and ASTNN). In terms of scalability, TreeCen is about 79 times faster than other state-of-the-art tree-based semantic code clone detectors (ASTNN), about 13 times faster than the fastest graph-based method (SCDetector), and even faster than a single-train The marker-based detector (RtvNN) is about 22 times faster.

Paper link: https://doi.org/10.1145/3551349.3556927

66、V-Achilles: An Interactive Visualization of Transitive Security Vulnerabilities

Abstract: An important threat to third-party dependencies is security vulnerabilities, which may lead to illegal access to user applications. As part of a dependency ecosystem, users of a library are susceptible to direct and transitive dependencies employed in their applications. Recent work has involved tools to support vulnerable dependencies, but few have shown the complexity of transitive updates. In this article, we introduce our solution to support vulnerability updates in npm. V-Achilles is a prototype that displays dependencies affected by vulnerability attacks through visualization (using a dependency graph). In addition to the tool overview, we highlight three use cases to demonstrate the utility and application of our prototype in real npm packages. The prototype can be found at [https://github.com/MUICT-SERU/V-Achilles↗](https://github.com/MUICT-SERU/V-Achilles) and is accompanied by a video demonstration, link for [https://www.youtube.com/watch?v=tspiZfhMNcs ↗](https://www.youtube.com/watch?v=tspiZfhMNcs).

Paper link: https://doi.org/10.1145/3551349.3559526

67、Xscope: Hunting for Cross-Chain Bridge Attacks

Cross-chain bridges have become the most popular solution to support the interoperability of heterogeneous blockchain assets. However, while providing efficient and flexible cross-chain asset transfer, the complex workflow involving on-chain smart contracts and off-chain programs also raises emerging security issues. Over the past year, cross-chain bridges have suffered more than a dozen serious attacks, resulting in billions of dollars in losses. Due to the lack of research on the security of cross-chain bridges, the community still lacks the knowledge and tools to deal with this significant threat. To fill this gap, we conducted the first study on the security of cross-chain bridges. We document three new classes of security vulnerabilities and propose a set of security properties and patterns for describing them. Based on these patterns, we designed Xscope, an automated tool for discovering security violations in cross-chain bridges and detecting real-world attacks. We evaluated Xscope on four popular cross-chain bridges. It successfully detected all known attacks and found previously unreported suspicious attacks. Xscope’s videos can be viewed at [https://youtu.be/vMRO_qOqtXY↗](https://youtu.be/vMRO_qOqtXY).

Paper link: https://doi.org/10.1145/3551349.3559520

68、‘Who built this crap?’ Developing a Software Engineering Domain Specific Toxicity Detector

In open source software (OSS) projects, toxic interactions among developers have a negative impact on the relationship between developers, so there is a need for a toxicity detector targeting the software engineering (SE) domain. However, previous studies found that existing toxicity detection tools perform poorly on SE texts. To address this challenge, I developed ToxiCR, a toxicity detector specifically targeted at the SE domain, and evaluated it on 19,571 manually labeled code review comments. I evaluated ToxiCR using ten supervised learning models, five text vectorization methods, and eight preprocessing techniques (two of which are SE domain-specific). After trying all possible combinations, I found that ToxiCR achieved an accuracy of 95.8% and an F1 score of 88.9%, significantly better than existing toxicity classifiers.

Paper link: https://doi.org/10.1145/3551349.3559508

The top academic conference on software engineering - ASE 2022 paper (network security direction) list, abstract and summary

Guess you like