[GNN+Encrypted Traffic C] An Encrypted Traffic Classification Method Combining GCN and Autoencoder

Introduction to the paper

Original title : An Encrypted Traffic Classification Method Combining Graph Convolutional Network and Autoencoder
Chinese title : An Encrypted Traffic Classification Method Combining Graph Convolutional Network and Autoencoder
Publication conference : IPCCC
Publication year : 2020-11-6
Author : Boyu Sun
latex quote :

@inproceedings{sun2020encrypted,
  title={An encrypted traffic classification method combining graph convolutional network and autoencoder},
  author={Sun, Boyu and Yang, Wenyuan and Yan, Mengqi and Wu, Dehao and Zhu, Yuesheng and Bai, Zhiqiang},
  booktitle={2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)},
  pages={1--8},
  year={2020},
  organization={IEEE}
}

Summary

The increase in the sources and scale of encrypted network traffic creates significant challenges for network traffic analysis. How to obtain high classification accuracy with a small number of labeled samples is a challenge facing the field of encrypted traffic classification. To solve this problem, this paper proposes a new encrypted traffic classification method that learns feature representations from traffic structure and flow data. We construct a k-nearest neighbor (KNN) traffic graph to represent the structure of traffic data, which contains more traffic similarity information. We utilize a two-layer graph convolutional network (GCN) architecture for flow feature extraction and encrypted flow classification.

We further use autoencoders to learn a representation of the streaming data itself and integrate it into the GCN-learned representation to form a more complete feature representation.

This method takes advantage of GCN and autoencoders and can achieve higher classification performance with very little labeled data. Experimental results on two public datasets show that our method achieves impressive results compared to state-of-the-art competitors.

Problems

Whether it is an ML-based method or a DL-based method, they usually only focus on the characteristics of the traffic data itself, and rarely consider the structure of the traffic data when classifying traffic.

Paper contribution

  1. A new encrypted traffic classification method is proposed. This method combines the advantages of GCN and autoencoders and can achieve higher classification performance with very little labeled data.
  2. A KNN traffic graph is proposed to represent the structure of traffic, which solves the problem that traditional traffic graphs contain less similar information.
  3. Achieved excellent encrypted traffic classification results on real network traffic data, outperforming several state-of-the-art methods

The paper’s approach to solving the above problems:

ACG is proposed to classify attack encrypted traffic based on attack fingerprints.

Thesis tasks:

Node classification task

work process

  1. Data preprocessing
  • Traffic segmentation:

    1. Remove the first 24 bytes of the pacp file header. These 24 bytes only contain the statistical information of the pacp file and are not helpful for traffic classification.
    2. Split the original traffic data into multiple traffic units based on five tuples. Divide the traffic into flow set F = {f1, f2,…, fn} and packet set fi = {pi1, pi2,…, piq}
  • Traffic purification:

    1. The data link header in each packet p is removed because it is filled with two MAC addresses that contain useless traffic classification features
    2. Make five tuples zero to anonymize them, as this information may corrupt the feature extraction process.
    3. Removed all duplicate and empty flow files
  • Uniform length:

    Process all streams into a uniform length. Streams larger than 900 bytes are trimmed to 900 bytes, and streams smaller than 900 bytes are appended with 0x00 to the end to make up to 900 bytes. If the value of the truncation length is too large, it will result in too many input parameters for the model and increase the complexity of model training. If this value is too small, the classification accuracy may be reduced because the truncated content may contain information about the flow classification characteristics. Therefore, we set the value of truncation length to 900.

  • Data normalization:

    Because a byte can be converted to an integer in the range [0, 255], we convert the 900-byte stream sequence into a 900-dimensional vector. Then, we normalize the flow vector to the range [0,1]. This can improve the accuracy and convergence speed of classification model training.

  1. Traffic graph construction

    After data preprocessing, a KNN graph is established as a traffic graph to represent the structure of traffic data.

    Calculate the flow similarity matrix through KNN:

    Insert image description here
    After calculating the similarity matrix S, we select the topk similar points of each flow as its neighbors to construct an undirected k-nearest neighbor graph.

    Insert image description here

  2. Classification model training

    Insert image description here

    • The role of GCN: extract structural features
    • The role of the autoencoder: Provide the hidden layer representation obtained by the autoencoder for the GCN hidden layer to prevent over-smoothing

    Hidden layer representation:
    Insert image description here

    Loss function:
    Insert image description here

experiment

  • Comparative Experiment:

    Insert image description here

  • sensitivity analysis

    Analysis of the nearest neighbor number K : This experiment aims to test the impact of the nearest neighbor number K on the construction of the KNN graph (traffic structure graph). We set K ={1,3,5,7,10} and measure the overall accuracy on both datasets. The experimental results in Table V show that when K=3, K=5 or K=7, our model can achieve better accuracy, but when K=1 and K=10, the performance drops significantly. We believe that the KNN graph contains less structural information, and when K = 1, the GCN module cannot effectively capture the structure of the traffic. When K = 10, the communities in the KNN graph may overlap.
    Insert image description here

    Parameter ϕ \phiϕ analysis: parametersϕ \phiϕ controls the contribution of the representations learned by the autoencoder and GCN. We set differentϕ \phiϕ , and the overall accuracy across all datasets was measured. From Table 6, we can find that the best accuracy occurs atϕ \phiϕ = 0.5, which indicates that the representations of GCN and autoencoder modules are equally important for our classification model. Whenϕ \phiWhen the value of ϕ approaches 0 or 1, the classification accuracy drops significantly, which shows that neither the representation of GCN nor the representation of a separate autoencoder can achieve high accuracy.
    Insert image description here

data set

  • ISCX VPNnonVPN
  • USTC-TFC2016
  • HIKARI-2021:Generating network intrusion detection dataset based on real and encrypted synthetic attack traffic

Guess you like

Origin blog.csdn.net/Dajian1040556534/article/details/132822679