Research on methods of using deep neural networks for vulnerability mining
Traditional vulnerability mining methods
Static method
- Rule/template based
- Code similarity detection
- Symbolic execution
Analysis of source code, high false positives
Dynamic method
- Fuzzing
- Stain analysis
Low code coverage
Hybrid method
Combining static and dynamic methods will inherit the above problems, and it is difficult to apply in practice
Introduction
Advantages of machine learning methods
Human intelligence (experience) plays an important guiding role (based on rules, based on extracted features)
difficult:
- It is difficult to convert security experts' understanding of vulnerabilities into feature vectors that can be learned by detection systems
- What the system learns from the feature set may also be affected by various factors,
such as the expressiveness of the model, data overfitting, noise in the data, etc.
Some concepts:
-
Gap (GAP): Machine learning needs to learn and understand the semantics of vulnerable code like humans (how to represent vulnerable code as machine-learnable information)
-
semantic gap: semantic gap, semantic gap
-
Traditional machine learning methods rely on hand-made features
-
Deep learning, RNN can automatically learn to reveal complex patterns of code semantics and advanced representation of software code
Factors considered by security practitioners to manually review vulnerabilities
Vulnerability semantic gap
A different expression of software vulnerabilities
Definition: An instance of a defect, caused by an error in the design, development, or configuration of the software, so that it can be used to violate certain explicit or implicit security policies
Security-related bugs ==" exploited by attackers ==" causing security failure or violation of security policy
The semantic gap of human inspection and detection systems
A high-level understanding of code semantics requires:
- Enough experience
- Programming knowledge
- Understanding of the programming language itself
- Understand the semantics and syntax of the code
- Have in-depth knowledge of the code base (an example is the heart drip vulnerability, which requires an understanding of the n2s function
- Safe coding practices
Automatic detection system
The dependent feature set comes from:
- Manually extracted (traditional machine learning algorithm
- Automatically extracted (deep learning
Detection systems cannot fully understand the underlying semantics of vulnerable code patterns as practitioners. Therefore, there is a semantic gap between knowledgeable and experienced practitioners and ML-based inspection systems, which we define as follows.
Definition of semantic gap:
Semantic gap is the lack of consistency between the abstract semantics of vulnerabilities that practitioners can understand and the semantics that ML algorithms can learn.
The accuracy of features is important:
- Features summarized and extracted by humans based on experience may lead to loss of information or personal bias
- Features extracted by machine learning algorithms may be underfitting, overfitting, or data noise
Traditional machine learning methods
classification
- Metrics based on software metrics
- Based on fragile code pattern
- Anomaly based on anomaly
Software metrics
McCabe: Software Complexity Index
The vulnerability detection model using these metrics is based on an assumption that a complex code is difficult for a practitioner to understand, and therefore difficult to maintain and test
Code Churn: Software modification frequency
The CodeChurn metric means that frequently modified code is often wrong, so considering that vulnerabilities are a subset of software defects, they are more likely to be flawed or vulnerable.
Software metrics can be used as a reference, but not as a decisive factor
Based on source code extraction mode
Bag of words=》 N-gram
It’s easy to ignore the context and the rich semantics of the code (exactly the same token and the same frequency of each token can have different code semantics
Based on structured information
Use code analysis tools to generate structured data
- AST abstract syntax tree
- CFG control flow diagram
- DFG data flow diagram
- PDG program dependency graph
- CPG code attribute diagram === Combine AST, CFG and PDG
The feature set of the results of the code analysis tool and the parser reveals more information about the code, because each form of program representation provides a view of the source code from different aspects
Helps analyze how variables flow from source to sink, and can be used to construct variables
Tracking data based on dynamic execution
Based on execution exception
Consider what led to the vulnerability, rather than how to generate loopholes
Features extracted from abnormal API usage patterns, imports and function calls, and API symbol missing check for vulnerability detection
Disadvantages: These feature sets are usually connected to a small group of vulnerabilities or specific types of vulnerabilities
The features extracted from imports and function calls can only train a classifier to detect vulnerabilities introduced by some fragile header files or libraries,
and API symbols related to missing checks can only be used to discover the lack of verification or boundary checks. Loophole
- Applications that only apply to specific tasks
- False positive
Deep learning technology
advantage
Compared with traditional machine learning, deep learning:
- Ability to learn advanced features or representations with more complex and abstract
- Ability to automatically learn more generalized potential features or representations, thereby freeing practitioners from labor-intensive, subjective and error-prone feature engineering tasks
- Provides flexibility, allowing the network structure to be customized for different application scenarios. For example, combining long short-term memory (LSTM) networks with dense layers for learning function-level representations as high-level features, implementing attention layers for learning the importance of features, and adding external memory "slots" to capture remote code dependencies
The characteristics of the above-mentioned deep learning technology enable researchers to build a detection system that can capture code semantics, understand code context dependencies, and automatically learn more generalizable advanced feature sets. With these functions, the built system can better "understand" the semantics and context of the code, so it will further narrow the semantic gap
Feature representation method to bridge the semantic gap
NLP natural language model is also effective for processing source code
Neural models facilitate representation learning
Different types of network structures are used to extract abstract features from various types of inputs, which we call feature representations, which are used to identify the semantic features of vulnerable code fragments
FCN fully connected network
MLP multilayer perceptron
Think of the network as a highly non-linear classifier that learns hidden and possibly complex fragile patterns.
Traditional ML algorithm:
- Random forest
- Support Vector Machine (SVM)
- C4.5
FCN can fit highly nonlinear and abstract patterns
FCN advantages:
- FCN has the potential to learn richer models than traditional ML algorithms, under large data sets. This potential has prompted researchers to use it to model potentially and complex fragile code patterns
- Input structure is irrelevant", which means that the network can take multiple forms of input data (such as images or sequences) and also provides researchers with the flexibility to manually make various types of functions for network learning.
CNN Convolutional Neural Network
Used to learn structured spatial data
CNN can capture the contextual meaning of words, which motivates researchers to apply CNN to learn context-aware fragile code semantics
RNN recurrent neural network
Used to process sequential data
The bidirectional form of RNN can capture the long-term dependence of the sequence. Therefore, many studies use bidirectional LSTM (Bi-LSTM) and gated recursive unit (G RU) structures to learn code context dependencies, which are essential for understanding the semantics of many types of vulnerabilities (such as buffer overflow vulnerabilities).
Various types
- deep belief network (DBN)
- variational autoencoders (VAEs)
Another promising feature of deep learning technology is that the network structure can be customized to meet different application scenarios
Existing job classification
- Graph-based feature representation: including AST, CFG, PDG and their combination
- Sequence-based feature representation: Use DNN to extract feature representations from sequential code entities, such as execution tracking, function call sequence, and variable flow/sequence
- Text-based feature representation: learning from source code
- Hybrid feature representation
Reason:
- The contribution of these studies lies in how to process software code to generate feature representations, thereby promoting DNN's understanding of code semantics and capturing patterns as indicators of potentially vulnerable code fragments.
- The DNN model works as a classifier with built-in representation learning capabilities. Existing research based on different types of feature inputs allows DNN to obtain high-level representations that reveal different semantic information.
Graph-based feature representation
- A method to detect SQL injection and XSS vulnerabilities: CFG\DFG
- Based on the AST method, 6 projects are introduced, half of which are not open to data sets and source codes (including manually labeled data sets, making it difficult to reproduce)
- No studies have compared AST with other forms of graph-based program representation, such as CFG, PDG, or DDG
AST
When extracting the AST from the source code, three types of nodes are retained:
- Function call node and class instance creation node
- Declare node
- Control flow node
In addition, ASTS, CFGS, PDG and DDG are graphical-based program representations. However, the above research did not use their original tree/graph form for processing, but "flattened" them before feeding them to the deep network. Graph embedding technology and graph-based neural networks can be used as alternative solutions to the vulnerability detection represented by the above-mentioned graph-based programs, and may also be more effective solutions.
Sequence-based feature representation
System execution tracking, function call sequence, statement sequence forming data flow, etc.
- Static features: The author extracts static features from a set of call sequences associated with standard C library functions, which requires the author to disassemble the binary file
- Dynamic characteristics: Obtaining dynamic characteristics requires executing the program in a limited time. During the execution, the author monitors the events of the program and collects the calling sequence.
The resulting dynamic call sequence contains a large number of function call parameters, which are low-level calculated values
the gadget code : code sequence comprising only description data dependencies, and further comprising control code sequence disclosed dependencies
semantic model and features accurately capture buffer error (CWE-119) and resource management error (CWE-399) vulnerability
thus , The code sequence forms the context, capturing the "global" semantics related to possible vulnerabilities.
code attention: In order to detect specific types of vulnerabilities, they also proposed the so-called "code attention" to pay attention to the "localized" information in the sentence
Text-based feature representation
Code text refers to the surface text of the source code, assembly instructions and the source code processed by the code lexer.
The hierarchical structure of DBN and FCN can learn high-level representation.
Variants of CNN and RNN (such as LSTM networks) are able to capture contextual patterns or structures from text corpora (such as source code or AST sequences)
Mixed feature representation
Two types of features are extracted from smali files:
- Token characteristics represented by the frequency of dalvik instructions showing token attributes
- Semantic features generated by traversing the AST of the smali file. In order to extract the token characteristics, the author divides the dalvik instructions of the smali file into eight categories, and builds a mapping table
Apply depth first search (DFS) traversal to convert AST to sequence
Challenges and future directions
A large number of ground-truth data sets
Data sets are the main obstacle to the development of this field. At the current stage, the proposed neural network-based vulnerability detection techniques are all evaluated on self-constructed data sets
There is an urgent need for a standard benchmark data set as a unified indicator for evaluating and comparing the effectiveness of the proposed methods.
Code analysis and neural learning
The network model applied to vulnerability detection becomes more and more complex. In order to better learn code semantics and indicate vulnerable code fragments, the network model becomes more and more expressive.
Semantic Preserving Neural Model
In the field of applying neural networks for vulnerability detection, a key point is to fill the semantic gap by enabling neural models to better use the semantic inference of programming languages
In the field of NLP, the latest developments in sequence modeling and natural language understanding are encouraging. For example, the transformer god based on the self-attention mechanism
Code means learning
In the field of vulnerability detection, due to the variety of vulnerable code patterns, the statement forms a fragile code context, either within the function boundary (within the procedure) or across multiple functions (between the procedures), so the definition is universal to describe all The feature set of type vulnerabilities is almost infeasible and impossible. Therefore, defining the feature set to reflect the characteristics of certain types of vulnerabilities can be a compromise choice, but the development of a detection system for specific types of vulnerabilities has achieved promising results
Human intelligibility of the model
ML models, especially neural network models, are black boxes, which means that the reasons for how the model makes predictions/classifications are unknown to practitioners. A large number of review studies did not try to explain the behavior of the model. In the field of vulnerability detection, it is impossible to understand how the model predicts that a piece of code is fragile/unbreakable, which may make the effectiveness of the model questionable. People may ask the following questions: Is the model trustworthy? Or is this (personal) prediction/classification reliable? The inability to understand the behavior of the model may be one of the obstacles hindering the application of neural network-based models to vulnerability detection in practice.