Research on Network Traffic Classification Method

Research on Network Traffic Classification Method

Traditional network traffic classification method

Port-based

The original standard protocols are all assigned fixed ports, such as the port number of HTTP service is 80, the port number of SMTP (Simple Mail Transfer Protocol) service is 25, etc. After analyzing the port information of the data packet, the protocol type of the network traffic can be inferred.
Problem: With the emergence of dynamic ports, masquerading ports, and non-standard port numbers, the recognition accuracy of this method decreases significantly.

Based on payload

Payload-based methods, such as deep packet inspection, can avoid dynamic port problems to a certain extent by searching for the signature of the application in the payload of the IP packet.
Problems:
-When encrypted traffic occurs, this method is difficult to implement and computationally expensive
-Faced with a rapidly developing network, it is necessary to constantly update and maintain the protocol signature database

Based on traffic statistics

According to some external statistical characteristics of traffic, such as time between packets, total number of packets, duration of traffic, etc., machine learning methods are
often used to classify traffic. Commonly used machine learning algorithms are:
-Classic algorithms in machine learning: random forest, SVM, Naive Bayes, C4.5, etc., various clustering algorithms
-deep learning models: neural network CNN, RNN, etc.; denoising autoencoder
features: high adaptability to dynamic ports and encrypted traffic, etc., is the current traffic classification Research hotspots in the field.

Traffic classification framework

Supervised learning

Use a set of labeled data samples to train a traffic classifier to classify unidentified traffic into known categories that can be identified. The current accuracy rate can basically reach more than 98%.

Unsupervised learning

Automatically classify a group of unlabeled samples through unsupervised learning algorithms such as clustering, and then label the clustering results accordingly with the help of tools such as DPI.

Public datasets available in the field

  1. WIDE data set: http://mawi.wide.ad.jp/mawi/
    This data set has been desensitized, only contains the IP header part, and has not been marked

  2. CIC data set: https://www.unb.ca/cic/datasets/ is
    mainly a data set of some intrusion detection and malware.
    The VPN-nonVPN traffic dataset (ISCXVPN2016) data set contains 7 types of encrypted traffic.
    This data set has not been passed Any treatment

The above two are commonly used data sets for paper writing in the field of traffic classification

Flow data preprocessing method

1. Take a single stream or a single session stream as the minimum identification unit
-extract the external statistical characteristics of each stream (such as inter-packet time, number of bytes, etc.) to form a feature vector to represent each stream sample
-extract all layers or applications of each stream The first 784 bytes or 1024 bytes of the layer, or how many packets are extracted from each stream, and how many bytes are extracted from each packet, so that each sample is represented as a grayscale image (matrix)

Features: sufficient feature information, whether it is stream statistical features or load features, can effectively represent specific protocol type information.
Disadvantages: the process of extracting features is time-consuming and complicated, and is not easy to implement.
Available feature processing tools: nDPI, nfstream

2. Take a single data packet as the minimum identification unit.
Each data packet takes the first 30 or 36 bytes of the application layer load part to form a feature vector to represent a sample.
Commonly used feature representation methods: n-gram or BoW model

Current development direction in the field of traffic classification

  1. How to perform real-time analysis of network traffic to achieve accurate classification in the current network environment that is rapidly updated and alternate

  2. Improve the accuracy and efficiency of identifying and extracting network traffic generated by unknown applications

  3. Identification and analysis of encrypted traffic

Guess you like

Origin blog.csdn.net/cherrychen2019/article/details/111592349