Overview of Encrypted Traffic Column

Encrypted traffic column

1. Principles

  1. Principle :
  1. The relationship between sessions, flows, and packets.
    Flow: refers to all packets with the same quintuple (source IP, source port, destination IP, destination port, protocol) Session
    : refers to all packets composed of bidirectional flows (reverse flow: source and destination interchange)
  • In a computer network, a series of data packets that define the same 5-tuple (source IP, destination IP, source port, destination port, protocol) belong to one flow .
  • The source IP is opposite to the destination IP, the source port is opposite to the destination port, and another flow with the same protocol is its reverse flow .
  • The forward flow and the reverse flow constitute a bidirectional flow or session .
  1. Papers :

Review category:

Research category:

2. Models

  1. Papers :

Review category:

  1. A Review of TLS Protocol Malicious Encrypted Traffic Identification Research (Chinese Paper)
  2. Machine Learning for Encrypted Malicious Traffic Detection: Approaches, Datasets and Comparative Study(重要)

Research category:

  1. Malicious encrypted traffic detection method combined with multi-feature recognition (Chinese paper)
  2. Research on encrypted malicious traffic detection based on stacking and multi-feature fusion (Chinese paper)
  3. Encrypted Traffic Classification Based on Packet Characteristics (Chinese paper)
  4. MBTree: Detecting Encryption RATs Communication Using Malicious Behavior Tree(CCF-A)

3. Article classification summary

3.1 Research direction

  1. Machine Learning for Encrypted Malicious Traffic Detection: Approaches, Datasets and Comparative Study(重要)
  2. Encrypted Traffic Research Direction
  3. Review Papers_Research Review of TLS Protocol Malicious Encrypted Traffic Identification (Chinese Paper)

3.2 Feature Extraction

  1. Research paper_Malicious encrypted traffic detection method combined with multi-feature recognition (Chinese paper_Journal of Information Security)
  2. Research paper_Research on encrypted malicious traffic detection based on stacking and multi-feature fusion (Chinese paper)
  3. Research Paper_Encrypted Traffic Classification Based on Packet Characteristics (Chinese paper)

3.3 Machine learning model and its improvement

3.4 Deep learning model and its improvement

Flow Sequence-Based Anonymity Network Traffic Identification with Residual GCN

  1. It is proposed to use the attribute and time relationship between flows to realize more reasonable and effective flow sequence feature extraction for traffic recognition. We assume that graph convolutional networks (GCNs) are suitable for our purposes, and propose a new RESGCN model to identify different web services.
  2. A practical scheme is devised to handle real-world raw traffic data. It considers traffic segmentation, traffic features for generating and enriching raw traffic, and lightgbm-based feature combination, avoiding unimportant features from reducing model performance and efficiency.
  3. The framework is evaluated on two real traffic datasets. Experimental results show that the method has good classification performance and is suitable for the identification of different network services.

insert image description here
insert image description here
Summary:
This paper proposes a novel flow-sequence-based framework for network traffic identification, which successfully identifies different anonymous network services using RESGCN by exploiting attribute relationships and temporal relationships between flows. Moreover, as an end-to-end real-time traffic identification method, our framework can effectively handle real traffic. It considers traffic segmentation, utilizes raw traffic generation and enriches traffic features, and lightgbm-based feature combination, avoiding unimportant features from reducing model performance and efficiency. Experimental results show that RESGCN classifier has higher classification accuracy, lower complexity and faster classification speed due to its excellent structure design.

  • Paper Highlights
  1. Accuracy:
    (1) By evaluating various parameters: such as the traffic segmentation method (time-based method: how many seconds to divide the traffic), the number of feature selections, and so on. to confirm the optimal parameters for the model.
    (2) A ResGCN method is proposed to improve the accuracy of the model: composition is made from two perspectives of time and attribute, so that the model can fully extract the information between each stream.
  2. Real-time performance:
    (1) Use lightGBM to extract features to ensure that the model extracts features fast enough.
    (2) Benefiting from the effective feature extraction of GCN on streaming sequences, RESGCN can achieve accurate classification without large parameters.
  • Dissertation Cons (Innovative Ideas)
  1. The model can only be applied to the classification of normal network application traffic, and does not consider the malicious traffic in the network. Therefore, a classification module for normal and abnormal traffic can be added to further expand the model.
  2. The robustness of the model needs to be further improved, that is, whether the model can correctly classify the traffic when it is attacked by some carefully crafted traffic delivered by malicious attackers.
  • tool
  1. Traffic analysis tool Tranalyzer2 , installation tutorial
  2. Data mining tool Weka , installation tutorial
  • data set

SJTU-AN21 dataset:https://github.com/iZRJ/The-SJTU-AN21-Dataset
ISCXVPN2016:https://www.unb.ca/cic/datasets/vpn.html


MAppGraph: Mobile-App Classification on Encrypted Network Traffic using Deep Graph Convolution Neural Networks

insert image description here
Paper Contributions:

  1. A method is developed to process network traffic and generate graphs with node characteristics and edge weights to better represent the communication behavior of mobile applications.
  2. A DGCNN model is developed that is able to learn the traffic behavior of mobile applications from a large number of graphs and achieve fast mobile application classification.
  3. The traffic of 101 mobile applications is collected, and extensive experiments are conducted to demonstrate the effectiveness of MAppGraph.
  4. AppScanner and FlowPrint are enhanced and used as a baseline for performance comparison with MAppGraph.

The method of the paper to solve the above problems:

  1. Only information such as packet size and inter-arrival time is extracted from the packet header, and statistical features are derived as attributes of graph nodes to represent the traffic behavior of communication between applications and services. without extracting the payload of the packet.
  2. The network traffic of mobile applications is collected at different times, causing each application to handle a large amount of network traffic. Each traffic block (eg, within 5 minutes) forms a communication graph. After collecting a large amount of traffic data, a mobile application will have a set of graphs, each graph representing the traffic behavior at a different moment. Learn or identify the fingerprint of each mobile application from this large set of graphs.

Tasks of the thesis:

Multiclassify applications. (This is a graph-level graph neural network classification problem)

Graph construction method:

Graph construction:

Node construction: use (destination ip, destination port) to construct a node. Each node has 63 features
Edge construction: C i , j = ∑ t = 1 Tai ( t ) ⋅ aj ( t ) C_{i,j} = \sum_{t=1}^Ta_i(t)·a_j (t)Ci,j=t=1Tai(t)aj(t)

In this way, in a graph, if an application (with the same IP) uses multiple services (that is, multiple different ports), then multiple nodes of the application will be generated in this graph, thus avoiding Single-node star topology.

Model:

DGCNN

Summary of the paper:

  1. learned method

    Theoretical approach:

    1. Learned a way to construct edge weights. Collect traffic for a period of time for an application, recorded as T windows T_{windows}Twindows. Then further slice this period of time, using tslice t_{slice} for each period of timetsliceIndicates that a total of TTT period of time. For eachtslice t_{slice}tsliceIn terms of , if there is data packet exchange between two nodes (that is, both receiving and sending data packets), then the two nodes will be in this tslice t_{slice}tsliceThe weight of the period is set to 1, otherwise it is set to 0. Finally in the whole T windows T_{windows}TwindowsDuring the time period, the edge weights of these two nodes are all tslice t_{slice}tsliceThe sum of the weights of the time period.
    2. It is roughly clear about the traffic interaction mode in the current actual scene.
    • Multiple third-party services may be deployed under the same server (that is, the same IP address), but with different port numbers
    • The same third-party service is usually deployed on multiple servers (different IP addresses) to provide load balancing.
    1. Simply using IP addresses to construct nodes results in an application-centric star topology, such that most of the graph has the same structure. To avoid this situation, (destination IP address, destination port) can be used instead of using a single IP address to construct graph nodes.
  2. Dissertation pros and cons

    advantage:

    1. The model of the paper stores the collected data to enrich the training set during real-time prediction, which can avoid the problem of data drift. Specifically, the problem of data drift is that when the traffic behavior of the application changes, the fingerprint of the application will also change. However, if the training samples of the model are still the samples collected before and have not been updated, the model will not be able to recognize the new traffic behavior of these applications, resulting in a significant drop in accuracy.

    shortcoming:

    1. The paper focuses on the traffic identification of known applications, and lacks the identification of unknown applications. However, this direction is discussed at the end, which is also a direction that can be innovated in the future.
  3. data set

    • ReCon [34], an application encrypted traffic dataset
    • Cross Platform[33], Application Encrypted Traffic Dataset
    • ANDRUBIS [23], Application Encrypted Traffic Dataset
    • Dataset collected by the paper itself
  4. readable citations

    • FLOWPRINT: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic
    • A Network Monitoring System for High Speed ​​Network Traffic (a system for collecting traffic, you can read it and see if you can install it)

EC-GCN: A encrypted traffic classification framework based on multi-scale graph convolution networks

insert image description here
Paper Contributions:

  1. Represent encrypted traffic as a graph and observe a number of distinct features that reveal latent spatial patterns in encrypted traffic.
  2. To understand the spatial dependencies hidden in flows, we creatively introduce GCN into our classification method and propose a new temporal-spatial (multi-subgraph) encrypted traffic classification framework. To the best of our knowledge, this is the first multigraph-based approach for classifying encrypted traffic. To make our method robust to noise and dynamics, we organize the graph into multiple levels of granularity and exploit features at all levels.
  3. In a deep learning framework, we creatively design an encoding layer to automatically convert encrypted traffic into a graph representation. Furthermore, we also propose a novel graphization layer and graph learning layer to dynamically extract multi-graph structures in encrypted traffic.
  4. To evaluate the performance of our proposed model, we conduct a series of experiments on 3 datasets. Experimental results show that compared with the existing methods, the accuracy of EC-GCN algorithm is improved by 5%~20%, and the fault tolerance is stronger.

The method of the paper to solve the above problems:

  1. Represent encrypted traffic as a graph and observe a number of distinct features that reveal latent spatial patterns in encrypted traffic. Use GCN to extract spatial features.
  2. To make the proposed method robust to noise and dynamics, the graph is organized into multiple levels of granularity and features from all levels are exploited.

Tasks of the thesis:

Classify encrypted traffic into specific applications by taking a sequence of packet lengths as input

Summary of the paper:

  1. learned method

    Theoretical approach:

    1. A new idea of ​​graph composition is provided: take a flow as a graph, the node feature is the packet length, and the edge weight is the transition probability of the packet length.
    2. In order to improve the real-time performance of the model, a lightweight graph pooling layer is proposed to transform the graph into a smaller subgraph layer by layer during the training process. The specific implementation method is as follows:
      H ( l + 1 ) = S ( l ) H ( l ) , where H ( l + 1 ) ∈ R nl + 1 × FH^{(l+1)} = S^{(l) }H^{(l)}, where H^{(l+1)} \in R^{n^{l+1} \times F}H(l+1)=S(l)H( l ) , whereH(l+1)Rnl+1×F W ( l + 1 ) = S ( l ) T W ( l ) S ( l ) ,其中 W ( l + 1 ) ∈ R n l + 1 × n l + 1 W^{(l+1)} = S^{(l)T}W^{(l)}S^{(l)},其中W^{(l+1)} \in R^{n^{l+1} \times n^{l+1}} W(l+1)=S(l)TW(l)S( l ) , whereW(l+1)Rnl+1×nl+1
    3. The edge weight matrix WW in the graphW method of learning graph structure:

    First define an IR IRIR
    I R = D ( l ) − 1 W ( l ) H ( l ) ,其中 I R ∈ R n l × F IR = D^{(l)-1}W^{(l)}H^{(l)},其中IR \in R^{n^{l} \times F} I R=D(l)1W(l)H( l ) , whereIRRnl ×FwhereD ( l ) D^{(l)}D( l ) meansW ( l ) W^{(l)}WThe diagonal matrix IR of ( l )
    represents the interaction score of each node in the graph. If the value in a row of IR is larger, it proves that the node (the row) interacts more frequently with the rest of the neighbor nodes. The following specifically introduces how to calculate the weight matrix of the current l layer:
    insert image description here

  2. Dissertation pros and cons

    advantage:

    1. The real-time nature of the model is considered. When constructing the model, a lightweight pooling method is used to reduce the scale of the graph layer by layer. At the same time, the update of the edge weight matrix is ​​also considered, so that the information loss caused by reducing the scale of the graph is reduced.
    2. Only the packet length feature is used. And extract its time and space information according to this feature.

    shortcoming:

    1. Since only metadata features are used as input, EC-GCN may be affected by some traffic shaping operations, such as padding into packets to normalize the sequence of packet lengths [36].

    Solution: This problem can be overcome to a certain extent by integrating more metadata features, including packet type sequence, packet interval sequence, and uplink/downlink sequence.

    1. Robust algorithms that struggle with protocol changes and traffic confusion.
  3. data set

    • OBW30: Self-collected dataset
    • HW19: Self-collected dataset
    • ISCX-Tor [34]: public dataset

3.5 Other models

MBTree: Detecting Encryption RATs Communication Using Malicious Behavior Tree

insert image description here
insert image description here

Innovative ideas :

  • The detection model should consider the stability of the model in different environments: the stability of the model can be verified by the distribution of the training set and the test set.
  • The detection model should consider the anti-imbalance and sample dependence of the model: generally it is better to be able to deal with unbalanced data or to obtain a high accuracy model with only a small number of samples

Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis

insert image description here

  1. Presents the first machine learning-based system for real-time and robust detection in high-throughput networks: Whisper, a novel malicious traffic detection system utilizing frequency-domain analysis.
  2. Sequence features are extracted through frequency-domain feature analysis, which lays the foundation for Whisper's detection accuracy, robustness, and throughput.
  3. Developed Whisper's automatic encoding vector selection, which reduces the workload of manual parameter selection and avoids manual parameter setting while ensuring detection accuracy.
  4. A theoretical analytical framework is developed to demonstrate Whisper's properties.
  5. Use Intel DPDK to build a Whisper prototype, and use experiments with different types of replay attack traffic to verify the performance of Whisper.

Summary 1:

  1. Can detect zero-day attacks : Whisper uses machine learning methods for detection, which can effectively deal with zero-day attacks
  2. Accuracy : Whisper effectively extracts and analyzes the sequence features of network traffic through frequency domain analysis, and extracts traffic features with low information loss, which can ensure accuracy.
  3. Robustness (robustness) : Since the frequency domain features represent the fine-grained sequential features of the packet sequence and are not disturbed by injected noise packets, Whisper can achieve robust detection.
  4. Real-time detection : Whisper extracts frequency domain features of traffic. The frequency domain features of traffic can effectively represent various packet sorting modes of traffic, and feature redundancy is low. Low feature redundancy ensures high-throughput flow detection.

Thanks to rich feature representation and lightweight machine learning, Whisper finally achieves real-time detection of malicious traffic in high-throughput networks.

Difficulty: Due to the large scale, complexity, and dynamics of traffic patterns, extracting and analyzing frequency-domain features from traffic is not trivial.

Summary 2:

  1. learned method
  • Theoretical approach:
  1. The features are transformed in the frequency domain and analyzed visually on the RGB map. As mentioned in 3.1 of the article.
  2. When encoding a feature, the encoding vector ww can bew is optimized using the SMT method, as mentioned in 3.2 in the article.
  3. If the cluster center is directly calculated for all samples, it may be affected by individual extreme values. At this point, you can set a window, first calculate an average value for each window, and then calculate the cluster center for these average values, so that some influence will be more or less eliminated. As mentioned in 3.3 of the article.
  • How to write a thesis:
  1. For the section on experimental evaluation, which questions should be written first? It can be discussed from the aspects of model advantages and model comparison.
  1. Dissertation pros and cons
  • The accuracy of the model is considered (the frequency domain features used represent the fine-grained sequential features of traffic, which can provide a deeper understanding of the data)
  • Considering the robustness of the model (can detect evasion attacks constructed by attackers, that is, inject various benign traffic to evade detection)
  • Considering the timeliness of the model (Whisper extracts the frequency domain features of traffic. The frequency domain features of traffic can effectively represent various packet sorting modes of traffic, and the feature redundancy is low. Low feature redundancy ensures high-throughput traffic detection .)
  1. tool

DPDK: Implementation of high-speed packet parser
mlpack: K-means clustering implementation
Z3 SMT solver: SMT problem solving

  1. 数据集
    WIDE. Accessed January 2021. MA WI Working Group Traffic Archive. http://mawi.wide.ad.jp/mawi/.

Encrypted Malware Traffic Detection via Graph-based Network Analysis

insert image description here

Contributions to the paper:

  1. We propose ST-Graph, a real-time malicious traffic detection framework in encrypted scenarios . ST-Graph effectively reveals malicious behaviors in encrypted networks by exploring and integrating multiple features, thereby achieving detection with a low false positive rate.
  2. A heterogeneous property graph is designed for encrypted traffic , and a new embedding method, interval-slanted random walk, is proposed to explore and fuse spatiotemporal features of traffic data.
  3. We evaluate the detection system for a year in several real-world network scenarios and observe promising results. Compared with other works, our detection model achieves higher accuracy (nearly 10 times of the baseline) and significantly reduces false positives at a tolerable time cost.
  4. Through actual deployment, our detection system found some cases of maliciousness that other systems could not, and revealed some emerging types of malicious traffic.

Tasks of the thesis:

  1. Malware detection: whether the host is infected by malware (two classifications)
  2. Malware Family Classification (Multi-Classification)

Summary of the paper:

  1. learned method

    Theoretical approach:

    1. The random walk of the graph can synthesize the temporal and spatial characteristics of the graph
    2. After getting the embedding of the host, you can also set the weight for the embedding (some information related to this embedding, softmax calculates the importance.)

    How to write a thesis:

    How to write the test evaluation :

    • Experimental setup:
    • Implementation: tools used in the process of implementing the core method
    • Baselines: baseline model for comparison
    • Environment and Parameters: Experimental environment and parameter settings
    • Metrics: evaluation indicators
    • data set:
    • Introduction to the dataset
    • How the dataset was collected
    • Segmentation of the dataset (training set, test set)
    • Ethical issues involved in collecting data
    • Detection effect:
    • Comparison of various baseline models and proposed models under different evaluation metrics on different datasets .
    • Generalization ability (ablation experiment)
    • Model Robustness Check: Adding Noisy Data
    • real world assessment

    What follows the experimental evaluation :

    • Discussion: Model Limitations
  2. Dissertation pros and cons

    advantage:

    1. High accuracy : ST-Graph is proposed to explore multiple features from spatial and temporal perspectives and integrate all available information for comprehensive malware traffic detection in encrypted scenarios.
    2. Good real-time performance : algorithms that improve graph representations only by iteratively updating optimized edge representations , while optimal node representations come from closed-form solutions . This greatly reduces the computational complexity of graph representation learning; ST-Graph learns edge embeddings with random walks with a small number of iterations, and optimizes host embeddings with a closed-form solution. This greatly reduces the computational complexity and meets the needs of real-time detection.
    3. Can detect unknown attacks
    4. practical application

    shortcoming:

    1. Set IIHow to represent the elements of I (order of host streams) is not specified.
    2. There are too few baseline models, which are slightly less convincing.
    3. network scale. The size of the internal network has an impact on the ST map. The increase of internal hosts will lead to more nodes and edges in our graph structure. The larger the number of edges in the graph, the higher the time cost of detection, and the gateway cannot handle an infinite number of hosts.
  3. innovative ideas

    Spatial-temporal features can be extracted without random walks, but other methods, such as GCN and so on.

  4. tool

    • Tshark: extract traffic
    • NetworkX to build heterogeneous graphs
    • Gensim's textual representation to initialize nodes
  5. data set

    • Public dataset AndMal2019
    • Private dataset EncMal2021

Accurate Decentralized Application Identification via Encrypted Traffic Analysis Using Graph Neural Networks

insert image description here
insert image description here
insert image description here
Contributions to the paper:

  1. A Traffic Interaction Graph (TIG) is proposed to represent each individual encrypted flow, where vertices in the TIG represent packets and edges represent packet-level interactions between a pair of clients and servers . We also provide quantitative measures to demonstrate the advantages of using TIG to represent streams over traditional packet-length sequences.
  2. Designed the GraphDApp model, a powerful GNN-based classifier using multi-layer perceptrons (MLPs) and fully-connected layers. It maps the TIGs of different DApp streams to different representations in the embedding space, without requiring hand-crafted features, thus enabling efficient and accurate classification.
  3. Collected real traffic data sets of 1,300 dapps on Ethereum, with traffic exceeding 169,000. We demonstrate the accuracy and efficiency of GraphDApp in closed and open environments. Compared with the state-of-the-art methods, GraphDApp achieves the highest classification accuracy and the shortest training time . In addition, it also applies to the traditional classification of mobile applications.

Tasks of the thesis:

  1. Closed Scenario: Realizing Multi-Classification of DApps
  2. Open Scenario: Realize the binary classification of DApps (normal and malicious)

Summary of the paper:

  1. learned method

    Theoretical approach:

    1. To test the proposed graph construction effect is better than other features, a quantitative method can be used (since the task of this paper is DApp fingerprint recognition, it is required that the flow generated by each DApp is as similar as possible, so the graph edit distance is used to measure the construction The way of the similarity of the flow represented by the graph ) (Original: III-C, this paper: Paper Contribution-2-3)
  2. Dissertation pros and cons

    advantage:

    1. The research object is the fingerprint recognition of DApp with less existing research
    2. The method of component diagram is worth learning from:
    • Packet direction information : The direction information is displayed by the symbol of the vertex in the TIG, where a positive value indicates a downlink data packet, and a negative value indicates an uplink data packet
    • Packet length information : Packet length information is a key feature for classifying encrypted traffic.
    • Packet burst information : The vertices of the same layer in the TIG represent the data packets that make up a single burst. The burst-level behavior of different applications can vary greatly, and thus can be used as a discriminative feature for classifier learning.
    • Packet ordering information : TIG can indicate the order of packets from the start of SSL/TLS session negotiation to the end of application data transmission. In addition, TIG also reflects the interaction between server and client.

    shortcoming:

    1. h v h_v hvIt is not clear how the initialization is, does it only include the size of the package (with direction)?
    2. The details of the model are not well described, such as which one is used by the Readout function.
    3. GraphDApp takes a relatively long time to flag unknown flows. The feature extraction time can be shortened by appropriately reducing the number of packets in tig, and the prediction speed can be accelerated by reducing the number of MLP layers and hidden units.
    4. As a fingerprinting scheme, when the fingerprint of the application changes, the accuracy will decrease accordingly. To solve this problem, we can periodically update the tig of the application and fine-tune the parameters in the classifier.
  3. innovative ideas

    • In addition to DApp, you can also try some better models on other new applications for fingerprint recognition
    • When constructing a TIG, in addition to the packet length information, whether other features can be used or added (these features are conducive to the classification of encrypted traffic, you can first pass feature screening or other important features mentioned in other papers)
    • You can use the GCN model to extract the embedding of each stream that is classified, followed by the fully connected layer for classification
    • Since the subject of this article is DApp, traffic capture is only carried out on the Chrome browser. You can consider whether the effect is different on different browsers.
    • Design the model to adapt to changes in the application's fingerprint. That is concept drift.
  4. tool

    • wireshark
  5. data set

    • Private Datasets: Manually Collected

Detecting Unknown Encrypted Malicious Traffic in Real Time via Flow Interaction Graph Analysis

insert image description here
insert image description here
Paper Contributions:

  1. proposed HyperVision, the first real-time unsupervised detection of encrypted malicious traffic with unknown patterns using flow interaction graphs.
  2. Several algorithms were developed to build memory graphs that allowed us to accurately capture the interaction patterns between various streams.
  3. A lightweight unsupervised graph learning method is designed to detect encrypted traffic through graph features.
  4. A theoretical analysis framework established by information theory is developed to show that the graph captures near-optimal traffic interaction information.
  5. Build HyperVision and verify its accuracy and efficiency using extensive experiments with various real-world encrypted malicious traffic.

The method of the paper to solve the above problems:

  1. Proposed HyperVision to achieve unsupervised detection
  2. Build a memory map to capture the interaction patterns between various flows, so as to achieve the purpose of detecting unknown malicious traffic
  3. Framed while preserving the ability to detect traditional (known) attacks using plaintext traffic, allowing for universal detection

Summary of the paper:

  1. learned method

    Theoretical approach:

    1. Clustering can be cut from the following angles:

    (1) Clustering similar to short streams:

    • streams have the same source address and/or destination address, which means that the behavior resulting from these addresses is similar;
    • the streams have the same protocol type;
    • The number of streams is large enough, that is, when the number of short streams reaches the threshold AGG LINE, it is ensured that the streams have sufficient repeatability.

    (2) Similar connected components can be clustered. The connected components are represented by the following quintuples, and then clustered by DBSCAN:

    • number of long streams
    • number of short streams
    • Indicates the number of edges of the short stream
    • The number of bytes of all long streams
    • Bytes of all short streams

    (3) Clustering similar to long streams

    1. Eight and four graph structure features are extracted for the edges related to short flows and long flows respectively, and the specific features are shown in the figure above
    2. Min-max normalization of features
    3. Clustering using the DBSCAN method with a smaller search range ϵ \epsilonϵ and a large mini-point value to avoid including dissimilar edges in the clusters, which could generate false positives
    4. For abnormal edges that cannot be clustered, treat it as a cluster containing only this edge

    (4) Cluster the stream using the clustering method:
    insert image description here

    1. Ways to reduce graph size
    • cluster short flow
    • clustering connected components
    • Pre-clustering long streams

    How to Write a Dissertation:
    Model Essays

  2. Dissertation pros and cons

    advantage:

    1. A detection method that considers encrypted traffic under unsupervised
    2. Considering the robustness of the model
    3. Taking into account the real-time nature of the model
    4. Considers the model's ability to detect unknown encrypted traffic
    5. Considering the versatility of the model, it can detect both encrypted traffic and non-encrypted traffic
    6. There are many references in the literature and the experimental work is extremely rich

    Disadvantages:
    did not expect

    Confused point:

    1. It is not clear how the SMT optimization method of Identifying Critical Vertices (identifying key nodes) in abnormal interaction detection is modeled
  3. innovative ideas

    • Connectivity Analysis (connectivity analysis) can also analyze other characteristics of each connected component, such as time interval, etc. If the time interval characteristics of each connected component are very concentrated, then it is not a bad idea to use DBSCAN to cluster according to time-related characteristics.
  4. tool

    • Z3 SMT solver : Solve the vertex cover problem to extract key vertices to minimize the number of clusters
    • NetFlow
    • Zeek
  5. data set

  6. readable citations

    • Flowlens: Enabling efficient flow classification for ml-based network security applications
    • Kitsune: An ensemble of autoencoders for online network intrusion detection
    • Deeplog: Anomaly detection and diagnosis from system logs through deep learning

3.7 Real-time detection

3.8 Concept Drift

tool

traffic analysis tool

model building tool

  • Data mining tool Weka , installation tutorial
  • NetworkX to build heterogeneous graphs
  • Gensim's textual representation to initialize nodes
  • mlpack: K-means clustering implementation
  • Z3 SMT solver: SMT problem solving

data set

Encrypted Traffic Dataset

Guess you like

Origin blog.csdn.net/Dajian1040556534/article/details/129217246