Abstract

Fuzzing is a commonly used technique designed to test software by automatically creating program inputs. Currently, the most successful fuzzy algorithms emphasize simple, low-overhead strategies that can effectively monitor program status during execution. Through compile-time instrumentation, these methods can access many aspects of program status, including coverage, data flow, and heterogeneous fault detection and classification. However, existing methods utilize blind random mutation strategies when generating test inputs. We propose a different method to use this state information to optimize mutation operators using reinforcement learning (RL). By integrating OpenAI Gym with libFuzzer, we can simultaneously use advancements in reinforcement learning and fuzzing to achieve deeper coverage across multiple benchmarks. Our technology connects the rich and efficient program monitors provided by LLVM Santizers with deep neural networks to learn mutation selection strategies directly from the input data. The cross-language asynchronous architecture we have developed allows us to apply any OpenAI Gym-compatible deep reinforcement learning algorithm to any fuzzy problems while minimizing the slowdown.

1. Introduction

Fuzz testing is a widely used testing technique that automatically detects software programs by repeatedly causing it to execute the input generated. These inputs are designed to find vulnerabilities in the target program by running as many code paths as possible under different conditions. The method of generating these program input sets (called input corpora) varies due to the trade-off between analysis and test execution speed. Black box fuzzing techniques [10], [20], [26] use a minimum amount of program information to quickly generate input, and the result is that they produce the smallest depth of code coverage. White box fuzzy testing [12] uses a constraint solver to analyze the source code in detail, revealing the deep path of the program branch. Although white box fuzzers can achieve high code coverage, each generated input requires the greatest computational cost. Greybox technology finds a balance between these two extremes, emphasizing a simple input generation strategy, which can effectively monitor the program status with minimal overhead [40], [43]. During compilation, Greybox fuzzers such as libFuzzer [24] and AFL [47] instrument programs can effectively monitor the progress during testing, enabling them to use the amount of coverage achieved so far to dynamically guide future generation of input generation . These Coverage-based Greybox Fuzzers (CGF) maintain a corpus of effective test inputs that are repeatedly modified randomly before being executed using a simple set of data manipulation operations called mutators. Each time the mutated input overwrites a new code segment, the input is added to the corpus so that it can be selected again to mutate further. By dynamically building a corpus in this way, CGF can gradually build a repository of useful input to guide more in-depth exploration. In practice, collaborative learning between fuzzy researchers occurs indirectly by sharing these well-input corpora, so future tests can begin with inputs that are appropriately constructed for a given program.

We observe that the CGF fuzzy test loop is a good framework for integrated automatic learning [32], because the test is usually executed very fast (as fast as the test function), and the compiler instrument provides a rich program state [40], [43 ]. Although an average 24-hour fuzzing test run may perform billions of mutations, there have been few studies attempting to improve this process, and the current state-of-the-art technology is uniform random selection. Although the combination of uniform randomness and multiple executions may eventually produce good results, the usual practice is to restart the operation periodically, because they will fall into the coverage of local minimums [24]. We recommend applying reinforcement learning to the fuzzing process to more intelligently select mutation operators (actions) to achieve deeper code coverage (higher rewards) in less time. By improving mutation operator selection, our method can more effectively drive local searches from existing seeds to interesting new test inputs.

Although the ubiquitous “fuzzy test loop” (Figure 1) has a striking similarity to the state-action-reward sequence in the high-intensity learning field (RL) [6], [36], there are significant obstacles to its direct application challenge. Choice of fuzzer mutator.

Due to the huge state space of possible program inputs, manual engineering features of this learning process are not feasible, and it may be difficult to generalize these features to new programs. We prove that, like other aspects of fuzzy testing, mutation operator selection can benefit from automated feature engineering through deep learning techniques [32]. By learning the deep representation of the input data, we can generate a model that can not only achieve greater coverage than libFuzzer on a trained program, but also a wider range of other programs without retraining .

Despite the progress made in deep reinforcement learning, there is still a huge difference between the execution time of a deep RL network (ms) and the potentially simple function (us) of fuzzer testing. To overcome this problem, we designed an asynchronous architecture for fuzz testing, and reinforcement learning treated this problem as a real-time Atari game (a video game console developed in the late 1970s, whose simple games are commonly used Based on the benchmark RL algorithm [27]) instead of discrete sequential decision problems. By integrating our architecture with the OpenAI Gym RL framework and the widely used CGF libFuzzer [24], we can immediately obtain a large fuzzy benchmark library [25], tools for large-scale fuzzy testing [15], [39] and Frontier depth RL algorithm [3], [27].

Summary of our main contributions:

We show that by using reinforcement learning, we can train the fuzzer to select transformers in an efficient way without reducing the fuzzer performance.
By using the asynchronous frame skipping technology originally developed to improve Atari games, our RL proxy can overcome the difference in execution time between the fuzzer (us) and the deep neural network (ms) [3]
Our method can achieve excellent average line coverage in most benchmarks (5/4)
We show that even if the baseline [24] runs more times, our RL-Fuzzer can penetrate more deeply into several different types of software.

2. Background and motivation

Despite their success, gray box symbolic execution often requires a large number of test calls (and mutations) to achieve relatively deep code coverage (even for simple programs, millions of times). This is mainly because their success is highly dependent on the symbol's ability to perform randomly generated inputs that pass various condition checks, resulting in a deeper code path. One solution is to deploy many fuzzy runs simultaneously on many different machines. Large-scale fuzzy deployment found more than 2,000 errors in more than 60 large-scale projects by many large software companies. This process runs continuously [15], [39], and red blood cells are shared by researchers to jointly improve future operations.

This focus on corpus generation and selection also indicates most of the current research. Skyfire [45] uses probabilistic context-sensitive grammar to construct structured corpus input more efficiently. Vuzzer [33] uses program state information to dynamically select promising code path inputs, while Learn & Fuzz [13] uses statistical learning and neural networks to learn how to construct structured inputs from a set of example test cases. Recently, these methods of structured input generation and filtering have also been extended to deep learning techniques [13], [29], [32].

Although it is useful to build particles for each program being tested step by step, continuous fuzz testing is still required to deal with changes in actively developed software. Although the default practice is intuitive, changing unstructured data (beginning with an empty corpus) has the advantage of unbiased wrt, which is the type of data passed to the function under test. In addition, a program that has never been tested will also start from this empty input corpus state.

Since the variants depend on the input they are modifying, the choice of the variant can benefit from the structure of the input it operates on. In particular, we show that making informed decisions about argument selection based on the structure of the input data will result in greater code coverage in a shorter period of time.

a.libfuzzer mutation operator

The fuzzing loop used by libfuzzer (Figure 1) is basically the same as the fuzzing tests used by other variant fuzzing tests. The execution starts with a seed input, selects and applies an argument (an operation in our RL context), and then creates and tests a new input. This new input may result in a rewarding result (more coverage) or no (no new coverage). If the new input results in increased coverage, it will be added to the input corpus to be selected in the future, otherwise it will be discarded.

The principle of coverage-guided Greybox fuzzy testing is that instead of spending a lot of calculation to understand the complex transfer function sequence that defines the program, it is better to try different inputs quickly, similar to the known inputs that increase coverage. By repeating these random perturbations many times, mutation fuzzy will produce an input that increases coverage at a certain point, thus adding a new valuable input to the corpus that is selected for mutation in the future (Figure 1). Over time, this process is repeated (called a fuzzy cycle), and a corpus appears that covers a large number of lines of code without understanding the format expected by the software under test.

Table 1 Set of libfuzzer mutation operators
Mutator	description
EraseBytes	Reduce size by deleting random bytes
InsertByte	Increase size by increasing random bytes
InsertRepeated Bytes	Increase the size by adding at least 3 random bytes
ChangeBit	Randomly flip one bit
ChangeByte	Replace bytes with random bytes
ShuffleBytes	Randomly rearrange input bytes
ChangeASCII Integer	Perform random mathematical operations on ASCII values and overwrite input
ChangeBinary Integer	Perform random mathematical operations on values in binary and overwrite input
CopyPart	Back to partial input
CrossOver	Reorganize with random parts of corpus / self
AddWordPersist AutoDict	Replace part of the input with the input that previously increased coverage (whole run)
AddWordTemp AutoDict	Replace part of the input with the input that recently increased coverage
AddWord FromTORC	Replace part of the input that performed the comparison most recently

The randomness of the mutator (Table 1) is a powerful driver of useful input for local search, but as a result, the test usually needs to be repeated multiple times to obtain additional coverage. Although the selection of new input from the corpus is always affected by the previous fuzzing steps, the actual relationship between the initial corpus and its content later in the test run is difficult to predict. Due to the random nature of the transformer used to generate the new input, this input selection process becomes more unpredictable. Furthermore, since the most successful fuzzer performs corpus distillation (an online technique for reducing the corpus to the smallest coverage set), the distribution of selection inputs can be changed in unpredictable ways at any time.

Unfortunately, in practice, the corpus is usually the only final product of the fuzzing test run. The actual testing process was basically not observed. We intend to extend the focus of fuzz testing from the corpus to the process of creating it. Using our architecture combined with existing tools such as OpenAI-Gym and Tensorflow, we can record test inputs (states) generated during testing and the operations that caused them. By understanding the selection of mutation operators in fuzzy testing of a single program, we can transfer this knowledge to several other unrelated programs without additional training.

In addition to providing an optimized platform for mutator operator selection, we can also use our RL framework to evaluate the efficacy of new mutator or mutator groups. Recent research on the more complex and more powerful mutatorFairFuzz [22] or mutator discovery generation technology [29] can benefit from RL-based evaluation because certain mutators may only become more effective in certain combinations or sequences, This finding is difficult to reveal by constructing mutator sets or parameterization in detail. For example, if additional coverage has not been achieved for a long time, trying to break the blocking condition (via the AddWordFromTORC operator) may bring the greatest return.

B. Use LLVM for efficient program status monitoring

A key difference between the three main categories of fuzzer black-box, gray-box, and white-box is the degree of procedural analysis that occurs during testing. Black box blur completely ignores the state to achieve maximum test throughput, while white box technology focuses on weight analysis to have a greater impact on every attempt. The difference between these two methods highlights the inherent trade-offs of all fuzzing techniques: when is it worth the time and effort to analyze? Our method shows that by using compile time to detect the state provided, one can learn how to more effectively blur with minimal overhead.

This can be achieved because libFuzzer is tightly integrated with the Sanitizers provided by the LLVM compiler [21], which is also the main reason why we chose to integrate with libFuzzer over other tools. As described in Table 2, LLVM provides five different types of runtime tools that can detect and classify program errors and complex data structures for dynamic code coverage tracking. Although the focus of our initial research work was to create a fuzzer that can achieve excellent line coverage, a major benefit of our framework is the ability to use this rich Sanitizer state to flexibly define RL reward functions. This can create fuzzers for certain categories of errors. All that is needed to achieve this specialization is to create a reward function for the desired result.

Table II LLVM SANITIZERS monitoring coverage and provide fault classification and detection
Sanitizer type	description
Thread Sanitizer	Data competition, deadlock
Address Sanitizer	UAF, buffer overflow, leak
Memory Sanitizer	Use uninitialized memory
Undefined behavior Sanitizer	Detect undefined behavior
Override Sanitizer	Code coverage, execution frequency and caller-callee function relationship.

Figure 2. Asynchronous integration of OpenAI Gym and LLVM-libFuzzer

C. Deep reinforcement learning

Reinforcement learning is a relatively simple concept. It assumes that an agent is located in an environment (fuzzing test program), observes the current state (via LLVM) and uses this information to select the action that can achieve the greatest return (mutation operator) ( Maximize the unique lines of code covered). RL has been used for many years, but recent breakthroughs in deep learning have promoted RL performance in many highly challenging areas [19]. AlphaGo defeated the world champion in a complex continuous game [41], a breakthrough in policy learning taught robots to run [17] and Tensorflow how to play Atari [4], [16]. The reason for these breakthroughs is the addition of deep neural networks to the RL tool set, which can automatically learn features without human guidance. Although many RL algorithms are being developed simultaneously in multiple domains, in order to truly benefit from research in other domains, it is necessary to provide a modular interface for the benchmark environment and the RL algorithms used to solve them. Fortunately, the widely used OpenAI Gym framework provides a set of standardized, reusable components for RL.

III. In-depth study of FUZZ

By designing an OpenAI Gym environment that converts software testing to competitive games, we can use the most advanced RL technology to improve libFuzzer's mutation operator choice among its many supporting programs [25]. Although our initial experiment only used a single learning fuzzer, our architecture is designed for simultaneous fuzzy (ClusterFuzz [39]) and simultaneous RL learning (eg A3C [27]).

Perhaps the biggest challenge in combining reinforcement learning with fuzz testing is the relative speed of fuzz execution (Figure 2). The effectiveness of the Greybox mutation fuzzer is largely due to its execution speed, which is largely limited only by the time required to execute the test function. As found in our experiments, the test can achieve an execution speed of more than 100,000 times per second (10us per conversion). These speeds are not instantaneous peaks, because we observed that the test run of libpng performed an average of 112,295 executions per second for 400 seconds. Although the average number of tests per second in a 24-hour fuzzy run is usually much lower (10 executions per second), this requires approximately 1 billion tests in 24 hours.

Another problem that combines learning and fuzzy testing is the huge state space when using real test conditions. Given that libFuzzer's default maximum input size is 4096 bytes, even a very simple mutator, such as InsertByte, can produce nearly 2564096 different results, even without considering changes in the input structure. More complex operations that rely on test call sequences (such as AddWord-FromTempAutoDict, AddWordFromPersistAutoDict, and AddWordFromTORC (Table 1)) will result in a more explosive state space size, depending on the structure of the software being tested and the previous sequence fuzzing test.

We describe our approach to this complexity by combining a novel asynchronous architecture with reinforcement learning and fuzzing:

We have defined the basic components of RL problem formulation, including blocks (Section III.A), rewards (Section III.B) and our effective method of choosing actions (Section III.C)
Using inspiration from RL work in the Atari field, we designed an asynchronous architecture to learn how to choose a mutator to increase coverage under real-fuzzy conditions (Section III.C)
We have described a modular architecture that aims to enable software researchers to use existing RL implementations with minimal prior knowledge. (Section 4)

A. Modeling program state in the context of RL

In order to formulate a solution to any sequential decision problem, it is necessary to determine how to represent a given position in the sequence, or to represent its state in appropriate terms. Where possible, it may be advantageous to construct a state representation to fit a subset of the program being tested. For example, in [13], [45], complex semantic structures are created by using the syntax of PDF and XML files, respectively. However, for our proposed architecture, we run directly on the byte array representation of the test input, providing a common format that matches the input expected by the neural network-based interface used by libFuzzer and our RL framework.

Although it is tempting to use the sequence of numbers [0-255] corresponding to each value in the byte array to encode the data as is, the byte-level method has substantial disadvantages because the test input data is not necessarily Encode values (such as RGB values in an image) but can also encode bit masks or other non-numeric values. Therefore, we prefer the bit-of-bits [32] technique to represent test case input to our neural network. For all experiments, we fixed the maximum input size to the default value specified by libFuzzer (4096 bytes):

Although our method follows the technique described in [32], we did perform initial experiments using byte representation, and empirically found that bit array technology produced better results for RL mutator selection.

B. Use RL rewards for flexible fuzzing

A key advantage of using RL is the ability to specify reward functions to stimulate ambiguous behavior that is deemed desirable. In our example, the fuzzy “high score” is strictly speaking to maximize the unique lines covered:

Where:
Rt: reward time 't'
covt: the only number of lines covered at 't'

The flexibility of the RL reward function means that you can create fuzzing tools with various specific needs, such as quickly discovering errors or improving rewards for specific code paths. Thanks to our integration with LLVM Sanitizer (Table 2), these other opportunities can be easily identified and additional rewards can be obtained without additional instruments. This can not only find useful testing strategies, but also can understand the test input more widely, so as to get the desired reward response.

c. Effective selection of mutation

Although mutation fuzzy contains a large number of actions with different random results, the most challenging aspect of applying RL to mutation selection is the time required to evaluate the neural network (milliseconds) and the time required to execute a single execution of the test program (micro Seconds). s). Although from the perspective of RL learning, it is ideal to treat it as a discrete fully observable problem (Markov decision process or MDP), but this is our assessment, which is impractical because it requires The fuze waits for RL evaluation before each test call.

Since performing any test is at least likely to reach new coverage without performing any test, it is almost meaningless to force the fuzzy test to be idle. Although it is not difficult to design an RL agent that can outperform random mutation selection on a per-action basis, if someone uses this technique for the same test time and someone uses this technique to perform less effectively than random selection, this is Rarely.

We note that by creating an arcade learning environment (ALE), this difference in execution time has been resolved before the Atari domain [3]. If the game simulator video frames are sampled regularly, you can learn while running the game at a faster speed. Frame skipping [3] includes letting the RL agent learn every k frames (rather than every frame), and repeat the previous operation while waiting for a new frame. This reduces the difference in execution time and has a net positive effect on RL performance. In the case of the fuzzy domain, frame skipping is more critical to performance, because performing fewer tests than the standard libfuzzer implementation puts the learning fuzzer at a significant disadvantage. As shown in (Figure 2), this can be achieved in an effective manner by maintaining a fixed-size circular buffer of previous abrupt actions.

A result of the asynchronous nature of our architecture (which allows us to skip frames in blurry time) also means that we are solving a partially observable Markov decision process (POMDP), rather than a serial discrete method that will provide Of fully observable MDP. This means that instead of receiving every state input sent to the fuzzing system, we learn based on regular observations of the real system state. This partial observability makes our problem formula similar to that discussed in [16], which uses an RL agent and can play Atari games on a flashing screen. We use the same method described in [16] to mitigate this partial observability, that is, to enhance our network architecture with a long- and short-term memory (LSTM) layer to ensure that our RL agent can learn temporarily dependent actions. We also note that the success of LSTMS applied to deep learning enhancement fuzzification through seed filtering [32] as a further incentive.

Although our asynchronous, partially observable formula does make the learning process more challenging, we believe that the trade-off is very beneficial to our asynchronous method and more faithfully represents the blur in the real world, while our asynchronous, partially The observable formula does make the learning process more challenging. We believe that this trade-off greatly supports our asynchronous method and more faithfully represents the fuzzy RL problem in the real world. In summary, our method differs from the series MDP RL formulation in the following ways:

Partial observability: Our RL agent only receives samples of test input asynchronously to determine which converter to choose. Although we cannot observe every state, we show that these snapshots provide an appropriate approximation of the state to inform the choice of the variant.
Asynchronous action selection: Since the speed of fuzzing far exceeds our RL network's ability to generate actions, we will replace the oldest selected action in the circular buffer as quickly as possible. Since libfuzzer can continuously loop this buffer without blocking, we can use our framework for testing without reducing the test speed.

4. Deep RL fuzzy test benchmark environment

Although historically the characteristics of hand-design are dominant, we prove that, like other aspects of fuzzy, mutation operator selection can also benefit from automated feature engineering through deep learning techniques [32].

In order to achieve this, we mainly use the rl learning framework tensorforce [36], which provides an easy-to-use JSON-based configuration framework for many rl algorithms. Since Tensorforce can learn any problem compatible with OpenAI, we can use any of its existing features for fuzzy problems. In our experiments, we chose deep double q learning (Section IV.A), whose configuration parameters are the same as the Cart-Pole benchmark provided by the author of the Tensorforce software [35]. Although the complete configuration can be obtained through the reference link, we noticed the priority reward (50K capacity) used in our configuration. Using the method originally defined in [38], the action reward sequence of the past state is replayed to accelerate learning. These replayed memories are prioritized according to their estimated value of the learning process.

In our experiments, we chose Deep Double-Q learning (Section IV.A), whose configuration parameters are the same as the Cart-Pole benchmark provided by the author of the Tensorforce software [35]. Although the complete configuration can be obtained through the reference link, we noticed the priority reward (50K capacity) used in our configuration. Using the method originally defined in [38], the action reward sequence of the past state is replayed to accelerate learning. These replayed memories are prioritized according to their estimated value of the learning process.

Since a key contribution of our work is a practical system that exceeds the current capabilities of libfuzzer, we use a complete list of 13 arguments (Table 1) as possible operational options for our RL agent. This is to ensure that our RL agent can solve problems similar to the libfuzzer benchmark test, and also provides a test bench that can provide information about which converters are useful and not useful for a given program in an urgent manner.

In this section, we will describe in more detail the three basic components required for fuzzification using the OpenAI Gym framework:

The execution of the RL agent: "The learning algorithm is used to update the weight network to achieve maximum return. (Deep Double-Q learning) (Section iv.a)
RL network: The basic neural network structure used to determine how to choose actions for a given state. (W / 64 device based on LSTM) (Section IV.B)
Benchmark environment: The problem indicates that it is a combination of fuzz test configuration and program under test configuration. These are subject to strict version control to ensure that minor changes to the problem definition are trackable and repeatable. (Section IV.C)

a.Deep Double-Q Learning RL Agent for Mutator Selection

The RL algorithm we use is based on Q learning, which means that it attempts to predict the utility of various operations without an explicit model—in our case, the variant is most likely to increase coverage (reward). This technology has been applied to many problems from elevator control to mobile robot navigation, etc. It has been widely used in RL learning experiments [34]. Unfortunately, whether Q learning is instantiated using a table or multi-layer neural architecture (such as DQN), it will encounter the problem of overestimation of the action value in the random domain.

This problem is especially common when learning to use our architecture to select mutation operators for fuzzification, because the asynchronous nature can make state observations and subsequent reward arrivals noisy, and there may be high delays.

Therefore, we experiment with a Q-learning variant called double-Q learning, which uses two sets of network weights to reduce overly optimistic value prediction. In our example, even though the total training time is relatively short (30000 seconds), it seems to be able to learn effectively. We define the double-Q learning error according to the definition of [14], which is a simple extension to the original DQN implementation and aims to maximize the computational efficiency:

among them:

$R_{t}$ : Time reward 'T'

$\theta _{t}$ : Network weight at time "t"

Using this algorithm, the RL network weights are optimized by selecting the operation "a" that maximizes (argmax) the expected return at the next time step (rt + 1) at time step "t".

Please note that the formula also takes into account the expected future returns, and uses a discount factor $\gamma \in \left [0,1 \right ]$ to determine the degree to which the algorithm should value immediate returns relative to the future predicted returns. In our research, we took advantage of the fact that this method was independently developed and more successful in the challenging Atari field without requiring us to implement it from scratch [35]. Although we mainly focus on the relatively simple RL learning configuration, but the training time is limited, but we intend to use it as a starting point for a more comprehensive benchmark in future work.

b. LSTM RL network, used to select Mutators with memory

As mentioned earlier, in order to deal with the partial observability of our formula, we extend the neural network based on the multilayer perceptron to a neural network that can learn potentially long sequences of states and actions.

This LSTM network layer [16] establishes a separate path for the previous flow of state and behavior (memory), enabling people to remember useful experiences and forget those that are no longer useful.

As shown in Figure 2, the network architecture uses bit arrays and generates Mutators selection, while the DDQN algorithm (Section IV.A) optimizes the weights to increase returns.

C. Experimental setup and RL agent training

Once we have defined our network architecture and learning algorithm, it is necessary to arrange a training experiment to obtain network weights that may produce effective action choices. Our goal is to train a fuzzing tool that can select mutation operators by maximizing the total unique line coverage to obtain the highest return. In our research, we observed that, unlike many typical RL problems, using short or medium length training fuzzing tools often results in RL agents being able to achieve low hanging fruit coverage as quickly as possible. In order to optimize deep code coverage for a longer period of time, this is not desirable, because when each training event is quickly terminated, the possibility of breaking a strict coverage barrier is relatively small.

To effectively cover low-hanging lines of code while starting from an empty corpus, you usually need to select those arguments (such as insertbyte and inserttrepeatedBytes) that try to create complex structures (addwordFromTorc, crossover, addwordFromTemp, etc.) for a long time.

For libjpeg, we observed that the first coverage barrier (covering 327 lines) usually takes 6000-9000 seconds of blurring (60-90M test cases of libjpeg decompression function) to break through. Although the actual coverage obtained after penetrating this barrier may vary, this usually corresponds to whether the fuzz test can create an effective jpeg header. We observe that training an RL agent without experiencing these breakthroughs will lead to fuzzing learning to quickly achieve easy coverage without the ability to perform in-depth coverage.

In addition to these difficulties, there is still a lot of time pressure on the configuration and setting of fuzz testing. Although the ideal method in terms of performance may be to repeat these very long training courses for several months, it is impractical in terms of calculation requirements.

In addition, delaying the start of fuze experiments for a long time will greatly reduce the ease of use of the framework. As eloquently described in the introduction [46], there is often urgent time pressure to start fuzzing as soon as possible. It is for these reasons that we limit training to a very short period of time. For the network evaluated in Section 5, we only trained 3 episodes, and the length of each episode is relatively long, 10,000 seconds. The length of this episode is long enough to break through the 327 line barrier in the training of 2 episodes in 3 episodes.

We attribute the success of deeper coverage at least in part to our ability to observe input (state) patterns that may break through barriers and mutations that may cause them. While conducting the learning experiment, we believe that it is possible to prove effective learning with a very short total training time (approximately 8 hours (total 30k seconds)).

Seven. Threat of effectiveness

The main goal of this work is to develop a framework that combines existing fuzzy tools with reinforcement learning. Despite our efforts, there are still several areas for improvement. In this section, we will describe some of the threats to effectiveness, with the purpose of further research and discussion in this area.

A. Unique line coverage as a performance indicator

The research focus of this article is on online coverage as an evaluation indicator rather than the number of faults or security vulnerabilities found. Although line coverage attempts to achieve the maximum width of code execution, there is no guarantee that it will effectively lead to the discovery of errors. In addition, the effective execution of rare conditions or branches [22], [44] may be crucial for fault detection, but we do not reward this with our coverage-only metrics. All in all, reaching the dragon's den (coverage) and killing the dragon (finding errors or exercising key branches) are different.

The reward function described in this article is flexible and can encourage the discovery of errors by rewarding the RL agent every time an error is detected in the disinfectant (Table 2), rather than performing line coverage. Unfortunately, this modification limits the types of failures that can be automatically detected. In order to implement a "fault-centric" RL proxy, it is also necessary to choose how many RL algorithms to reward for each discovered fault.

B. Limitations of intelligent mutation selection

A key goal of the experiment is to determine whether we can significantly improve the fuzzy performance by modifying the mutation operator selection process. This is to evaluate RL in the context of simple control experiments. However, it is important to consider that simply improving mutation selection may not be sufficient to achieve thorough testing of the target program. As mentioned earlier, we have designed a flexible framework in this area to facilitate further research in other related fuzzy areas, such as branch target positioning [33], seed selection / screening [32], or even enhancing the mutants that are being executed Action type [22], [23]

c. Expand the scale and scope of the test project

Our goal is to evaluate our method on a set of relatively well-known software test benchmarks provided by the libfuzzer test suite [25]. However, analysis of a larger set of more complex functions will reveal other advantages and disadvantages of our method. In particular, it is still an open question to further discuss how the function or domain of the software under test affects the effectiveness of our method. For example, why do we perform better on re2 / sqlite than on boringssl?

Eight. in conclusion

Inspired by the success of reinforcement learning in challenging and noisy fields, we show that even within a limited training time (about 8 hours), as long as the mutation operator is selected more efficiently, we can significantly improve code coverage. Our method was demonstrated in five benchmark projects evaluated under real conditions, using far more runs than most existing fuzzy studies.

We prove that through strategic selection mutations, we can achieve deeper coverage in all test procedures, even if the current state-of-the-art random selection has more chances of success (Table 4). Although this initial work aimed at learning how to choose the variants more intelligently has been successful, it is dwarfed by what our architecture can achieve in the future: this will be dwarfed by: lib-fuzzer Combined with learning (openai) under a single multi-platform, multi-language distributed architecture (via GRPC). By using asynchronous buffers as a coordination mechanism between fuzzy and machine learning, we reduce the burden of microsecond delay fuzzy expectations and enable the future to evaluate and compare RL methods fairly under different computational constraints.

9. Confirm

I would like to thank all the reviewers who helped determine the stated direction of work. The project is funded by the Test Resource Management Center (TRMC) Test and Evaluation / Science and Technology (T & E / S & T) project, and the US Army Project Executive Office (PEO Stri) is responsible for simulation, training, and instrumentation. Contract No. W900KK-16-C-0006, Rugged Internal and External Test (RIOT). NAVAIR publicly released 2018-165. Distribution Statement A-Approval for public release; distribution is not restricted.

Pupils skipping class

Published 43 original articles · Like 23 · Visits 30,000+

Private letter concerns

FuzzerGym: A Competitive Framework for Fuzzing and Learning