Research on methods of using deep neural networks for vulnerability mining

Research on methods of using deep neural networks for vulnerability mining

Traditional vulnerability mining methods

Static method

  • Rule/template based
  • Code similarity detection
  • Symbolic execution

Analysis of source code, high false positives

Dynamic method

  • Fuzzing
  • Stain analysis

Low code coverage

Hybrid method

Combining static and dynamic methods will inherit the above problems, and it is difficult to apply in practice

Introduction

Advantages of machine learning methods

Human intelligence (experience) plays an important guiding role (based on rules, based on extracted features)

difficult:

  1. It is difficult to convert security experts' understanding of vulnerabilities into feature vectors that can be learned by detection systems
  2. What the system learns from the feature set may also be affected by various factors,
    such as the expressiveness of the model, data overfitting, noise in the data, etc.

Some concepts:

  • Gap (GAP): Machine learning needs to learn and understand the semantics of vulnerable code like humans (how to represent vulnerable code as machine-learnable information)

  • semantic gap: semantic gap, semantic gap

  • Traditional machine learning methods rely on hand-made features

  • Deep learning, RNN can automatically learn to reveal complex patterns of code semantics and advanced representation of software code

Factors considered by security practitioners to manually review vulnerabilities

Vulnerability semantic gap

A different expression of software vulnerabilities

Definition: An instance of a defect, caused by an error in the design, development, or configuration of the software, so that it can be used to violate certain explicit or implicit security policies

Security-related bugs ==" exploited by attackers ==" causing security failure or violation of security policy

The semantic gap of human inspection and detection systems

A high-level understanding of code semantics requires:

  1. Enough experience
  2. Programming knowledge
  3. Understanding of the programming language itself
  4. Understand the semantics and syntax of the code
  5. Have in-depth knowledge of the code base (an example is the heart drip vulnerability, which requires an understanding of the n2s function
  6. Safe coding practices

Automatic detection system

The dependent feature set comes from:

  1. Manually extracted (traditional machine learning algorithm
  2. Automatically extracted (deep learning

Detection systems cannot fully understand the underlying semantics of vulnerable code patterns as practitioners. Therefore, there is a semantic gap between knowledgeable and experienced practitioners and ML-based inspection systems, which we define as follows.

Definition of semantic gap:

Semantic gap is the lack of consistency between the abstract semantics of vulnerabilities that practitioners can understand and the semantics that ML algorithms can learn.

The accuracy of features is important:

  1. Features summarized and extracted by humans based on experience may lead to loss of information or personal bias
  2. Features extracted by machine learning algorithms may be underfitting, overfitting, or data noise

Traditional machine learning methods

classification

  1. Metrics based on software metrics
  2. Based on fragile code pattern
  3. Anomaly based on anomaly

Software metrics

McCabe: Software Complexity Index

The vulnerability detection model using these metrics is based on an assumption that a complex code is difficult for a practitioner to understand, and therefore difficult to maintain and test

Code Churn: Software modification frequency

The CodeChurn metric means that frequently modified code is often wrong, so considering that vulnerabilities are a subset of software defects, they are more likely to be flawed or vulnerable.

Software metrics can be used as a reference, but not as a decisive factor

Based on source code extraction mode

Bag of words=》 N-gram

It’s easy to ignore the context and the rich semantics of the code (exactly the same token and the same frequency of each token can have different code semantics

Based on structured information

Use code analysis tools to generate structured data

  • AST abstract syntax tree
  • CFG control flow diagram
  • DFG data flow diagram
  • PDG program dependency graph
  • CPG code attribute diagram === Combine AST, CFG and PDG

The feature set of the results of the code analysis tool and the parser reveals more information about the code, because each form of program representation provides a view of the source code from different aspects

Helps analyze how variables flow from source to sink, and can be used to construct variables

Tracking data based on dynamic execution

Based on execution exception

Consider what led to the vulnerability, rather than how to generate loopholes

Features extracted from abnormal API usage patterns, imports and function calls, and API symbol missing check for vulnerability detection

Disadvantages: These feature sets are usually connected to a small group of vulnerabilities or specific types of vulnerabilities

The features extracted from imports and function calls can only train a classifier to detect vulnerabilities introduced by some fragile header files or libraries,
and API symbols related to missing checks can only be used to discover the lack of verification or boundary checks. Loophole

  • Applications that only apply to specific tasks
  • False positive

Deep learning technology

advantage

Compared with traditional machine learning, deep learning:

  1. Ability to learn advanced features or representations with more complex and abstract
  2. Ability to automatically learn more generalized potential features or representations, thereby freeing practitioners from labor-intensive, subjective and error-prone feature engineering tasks
  3. Provides flexibility, allowing the network structure to be customized for different application scenarios. For example, combining long short-term memory (LSTM) networks with dense layers for learning function-level representations as high-level features, implementing attention layers for learning the importance of features, and adding external memory "slots" to capture remote code dependencies

The characteristics of the above-mentioned deep learning technology enable researchers to build a detection system that can capture code semantics, understand code context dependencies, and automatically learn more generalizable advanced feature sets. With these functions, the built system can better "understand" the semantics and context of the code, so it will further narrow the semantic gap

Feature representation method to bridge the semantic gap

NLP natural language model is also effective for processing source code

Neural models facilitate representation learning

Different types of network structures are used to extract abstract features from various types of inputs, which we call feature representations, which are used to identify the semantic features of vulnerable code fragments

FCN fully connected network

MLP multilayer perceptron

Think of the network as a highly non-linear classifier that learns hidden and possibly complex fragile patterns.

Traditional ML algorithm:

  • Random forest
  • Support Vector Machine (SVM)
  • C4.5

FCN can fit highly nonlinear and abstract patterns

FCN advantages:

  1. FCN has the potential to learn richer models than traditional ML algorithms, under large data sets. This potential has prompted researchers to use it to model potentially and complex fragile code patterns
  2. Input structure is irrelevant", which means that the network can take multiple forms of input data (such as images or sequences) and also provides researchers with the flexibility to manually make various types of functions for network learning.

CNN Convolutional Neural Network

Used to learn structured spatial data

CNN can capture the contextual meaning of words, which motivates researchers to apply CNN to learn context-aware fragile code semantics

RNN recurrent neural network

Used to process sequential data

The bidirectional form of RNN can capture the long-term dependence of the sequence. Therefore, many studies use bidirectional LSTM (Bi-LSTM) and gated recursive unit (G RU) structures to learn code context dependencies, which are essential for understanding the semantics of many types of vulnerabilities (such as buffer overflow vulnerabilities).

Various types

  • deep belief network (DBN)
  • variational autoencoders (VAEs)

Another promising feature of deep learning technology is that the network structure can be customized to meet different application scenarios

Existing job classification

  • Graph-based feature representation: including AST, CFG, PDG and their combination
  • Sequence-based feature representation: Use DNN to extract feature representations from sequential code entities, such as execution tracking, function call sequence, and variable flow/sequence
  • Text-based feature representation: learning from source code
  • Hybrid feature representation

Reason:

  1. The contribution of these studies lies in how to process software code to generate feature representations, thereby promoting DNN's understanding of code semantics and capturing patterns as indicators of potentially vulnerable code fragments.
  2. The DNN model works as a classifier with built-in representation learning capabilities. Existing research based on different types of feature inputs allows DNN to obtain high-level representations that reveal different semantic information.

Graph-based feature representation

  1. A method to detect SQL injection and XSS vulnerabilities: CFG\DFG
  2. Based on the AST method, 6 projects are introduced, half of which are not open to data sets and source codes (including manually labeled data sets, making it difficult to reproduce)
  3. No studies have compared AST with other forms of graph-based program representation, such as CFG, PDG, or DDG

AST

When extracting the AST from the source code, three types of nodes are retained:

  1. Function call node and class instance creation node
  2. Declare node
  3. Control flow node

In addition, ASTS, CFGS, PDG and DDG are graphical-based program representations. However, the above research did not use their original tree/graph form for processing, but "flattened" them before feeding them to the deep network. Graph embedding technology and graph-based neural networks can be used as alternative solutions to the vulnerability detection represented by the above-mentioned graph-based programs, and may also be more effective solutions.

Sequence-based feature representation

System execution tracking, function call sequence, statement sequence forming data flow, etc.

  • Static features: The author extracts static features from a set of call sequences associated with standard C library functions, which requires the author to disassemble the binary file
  • Dynamic characteristics: Obtaining dynamic characteristics requires executing the program in a limited time. During the execution, the author monitors the events of the program and collects the calling sequence.

The resulting dynamic call sequence contains a large number of function call parameters, which are low-level calculated values

the gadget code : code sequence comprising only description data dependencies, and further comprising control code sequence disclosed dependencies
semantic model and features accurately capture buffer error (CWE-119) and resource management error (CWE-399) vulnerability
thus , The code sequence forms the context, capturing the "global" semantics related to possible vulnerabilities.

code attention: In order to detect specific types of vulnerabilities, they also proposed the so-called "code attention" to pay attention to the "localized" information in the sentence

Text-based feature representation

Code text refers to the surface text of the source code, assembly instructions and the source code processed by the code lexer.

The hierarchical structure of DBN and FCN can learn high-level representation.

Variants of CNN and RNN (such as LSTM networks) are able to capture contextual patterns or structures from text corpora (such as source code or AST sequences)

Mixed feature representation

Two types of features are extracted from smali files:

  1. Token characteristics represented by the frequency of dalvik instructions showing token attributes
  2. Semantic features generated by traversing the AST of the smali file. In order to extract the token characteristics, the author divides the dalvik instructions of the smali file into eight categories, and builds a mapping table

Apply depth first search (DFS) traversal to convert AST to sequence

Challenges and future directions

A large number of ground-truth data sets

Data sets are the main obstacle to the development of this field. At the current stage, the proposed neural network-based vulnerability detection techniques are all evaluated on self-constructed data sets

There is an urgent need for a standard benchmark data set as a unified indicator for evaluating and comparing the effectiveness of the proposed methods.

Code analysis and neural learning

The network model applied to vulnerability detection becomes more and more complex. In order to better learn code semantics and indicate vulnerable code fragments, the network model becomes more and more expressive.

Semantic Preserving Neural Model

In the field of applying neural networks for vulnerability detection, a key point is to fill the semantic gap by enabling neural models to better use the semantic inference of programming languages

In the field of NLP, the latest developments in sequence modeling and natural language understanding are encouraging. For example, the transformer god based on the self-attention mechanism

Code means learning

In the field of vulnerability detection, due to the variety of vulnerable code patterns, the statement forms a fragile code context, either within the function boundary (within the procedure) or across multiple functions (between the procedures), so the definition is universal to describe all The feature set of type vulnerabilities is almost infeasible and impossible. Therefore, defining the feature set to reflect the characteristics of certain types of vulnerabilities can be a compromise choice, but the development of a detection system for specific types of vulnerabilities has achieved promising results

Human intelligibility of the model

ML models, especially neural network models, are black boxes, which means that the reasons for how the model makes predictions/classifications are unknown to practitioners. A large number of review studies did not try to explain the behavior of the model. In the field of vulnerability detection, it is impossible to understand how the model predicts that a piece of code is fragile/unbreakable, which may make the effectiveness of the model questionable. People may ask the following questions: Is the model trustworthy? Or is this (personal) prediction/classification reliable? The inability to understand the behavior of the model may be one of the obstacles hindering the application of neural network-based models to vulnerability detection in practice.

Guess you like

Origin blog.csdn.net/cherrychen2019/article/details/111588410