【Paper Reading】New Directions in Automated Traffic Analysis

Original title: New Directions in Automated Traffic Analysis
Original author: Jordan Holland; Paul Schmitt; Nick Feamster; Prateek Mittal
Presentation conference: CCS 2021
Original link: https://dl.acm.org/doi/abs/10.1145/3460120.3484758Chinese
title : A New Direction for Automated Traffic Analysis

1 Motivation

In order to automate the current work in the field of network traffic analysis: feature selection and representation, model selection and parameter adjustment , the author proposes nPrintML (a combination of nPrint and AutoML). In nPrint, the author proposes a way to represent the original traffic data. After the original pcap file is processed by nPrint, it can be directly used as the input of the subsequent machine learning model. At the same time, combined with AutoML technology, data preprocessing -> model selection -> hyperparameter adjustment forms a one-stop automatic service.

2 Main work of the paper

  • Proposed nPrint, a process that can process raw traffic data to form a standardized data representation format that can be directly used as input for subsequent machine learning models
  • Combined with AutoML, the work of model selection and hyperparameter tuning is automated
  • The nPrintML system was tested on several tasks, demonstrating good performance while saving a lot of human effort

3 Model implementation

The overall structure of the nPrintML model is shown in the figure below:
insert image description here

3.1 Data preprocessing

In this section, the author introduces the reasons why the representation format of nPrint is proposed, as well as the specific format. The authors discuss three ways to represent network traffic, namely: semantic-based representation, binary representation, and hybrid representation. First, a comparison of the three representation methods is given, as shown in the figure below.
insert image description here
insert image description here
The details of these three methods are discussed next:

  • Semantic based representation . A classic view of network traffic sees a packet as a collection of higher-level headers, such as IP, TCP, and UDP. Each header has semantic fields such as IP TTL, TCP port number, and UDP length fields. Standard semantic representations of network traffic collect all these semantic fields in one representation.
  • Naive binary representation . We can use raw bitmap representations to preserve ordering and alleviate reliance on manual feature engineering. This choice results in a consistent, pre-normalized representation, akin to an "image" of each packet.
  • Mixed representation . That is nPrint. nPrint is a hybrid of semantic and binary packet representation, representing a packet as a raw binary data model, but aligning the binary data in a way that recognizes that the packet itself has a specific semantic structure. By using internal padding, nPrint mitigates possible misalignment of unaligned binary representations, while still preserving the order of options. In a word, nPrint is based on the binary representation, filling the protocol part that does not appear in the original traffic with -1, which not only ensures the integrity of the original data, but also makes the processed data have a unified length .

The author said that the reason why nPrint is used as the representation of the original traffic is that the semantic-based representation method loses certain fields, such as TCP options. Binary nPrints lack alignment, and misaligned features lead to lower performance of subsequent learners.

The author implements nPrint with C++, and proves that nPrint has small memory footprint and fast processing speed through experiments. At the same time, the author also open sourced the nPrint project, the project homepage: nPrint

3.2 Combined with AutoML technology

The following quotes the author's original words on the combination of AutoML technology:

We use AutoGluon-Tabular to perform feature selection, model search, and hyperparameter optimization on all eight problems we evaluate. We chose AutoGluon because it has been shown to outperform many other public AutoML tools on the same data, and it is open source, although nPrint's well-structured format makes it suitable for use with any AutoML library. While many AutoML tools search for a set of models and corresponding hyperparameters, AutoGluon achieves higher performance by integrating multiple well-behaved individual models. AutoGluon-Tabular allows us to train, optimize, and test more than 50 models per problem derived from 6 different base model classes, which are variants of tree-based methods, deep neural networks, and neighbor-based classification. The highest performing model for each problem we examine is a collection of base model classes. AutoGluon has a preset parameter that determines the training speed and model size versus the overall predictive quality of the trained model. We set the preset parameter to high_quality_fast_inference_only_refit, resulting in a model with high prediction accuracy and fast inference. There is a "best quality" quality preset that creates models with slightly higher predictive accuracy, at the cost of 10x to 200x slower inference and 10x to 200x higher disk usage. We made this decision because we believe inference time is an important metric when considering network traffic analysis. We note that the preset parameters of AutoML tools do not represent the training of a single model, but rather the optimization of a set of models for a given task. We set no limit on model training time, allowing AutoGluon to find the best model and split each dataset into 75% training and 25% testing. Finally, we set the evaluation metric to f1_macro, which represents an F1 score calculated by computing the F1 score for each class in a multiclass classification problem and computing their unweighted average. This decision causes AutoGluon 5 to tune hyperparameters and ensemble weights to optimize the F1 macro score on the validation data.

nPrintML can be used for both online and offline data. For the specific methods and commands of this model, see the nPrint project home page. The author uses a heat map to show the degree of attention of the model to specific fields, and also reflects the importance of certain fields in specific classification tasks from the side, which is very intuitive.
insert image description here

4 Experimental evaluation

This paper tests the performance of nPrintML on 8 tasks and compares it with manual work, as shown in the figure below:
insert image description here

4.1 Active Device Fingerprinting

In this section, the author mainly compares the recognition results with nmap, as shown in the figure below:
insert image description here
insert image description here
At the same time, experiments show that nPrint also has an advantage in time:
insert image description here

4.2 Passive System Fingerprinting

In this section, the author mainly compares with p0f, and the experimental results are shown in the figure below:
insert image description here

4.3 DTLS Application Identification

In this section, the authors test nPrintML's ability to recognize a set of applications via a DTLS handshake.
Data collected with Chrome and Firefox browsers respectively:
insert image description here
Experimental results show that a weighted ensemble classifier trained on nPrint representations achieves a perfect ROC AUC score, 99.8% accuracy and 99.8% F1 score.

nPrintML can almost perfectly identify the (browser, application) pair that generates each handshake. While in previous work manually designed features achieved the same accuracy on easier problem versions, nPrintML completely avoids model selection and feature engineering, matching the performance of hand-crafted features and models on more difficult problem instances.

There are some other experimental tasks, interested students can refer to the original text.

5 summary

This paper creates a new direction for automated traffic analysis, proposing nPrint, a unified packet representation that takes raw network packets as input and transforms them into a format suitable for representation learning and model training, this standard The format makes it easy to integrate network traffic analysis with state-of-the-art automated machine learning (AutoML) pipelines. nPrintML is the integration of nPrint and AutoML, which can automatically learn the best model, parameter setting and feature representation for the corresponding task. nPrint has demonstrated that many network traffic classification tasks can be automated. The author also open sourced the code of the project, and also released all the data sets used in the experiment.

Guess you like

Origin blog.csdn.net/airenKKK/article/details/124674912