[Paper Sharing] Heterogeneous Graph Malicious Domain Detection Method: HGDom: Heterogeneous Graph Convolutional Networks for Malicious Domain Detection

  • 题目:HGDom: Heterogeneous Graph Convolutional Networks for Malicious Domain Detection
  • Link: https://ieeexplore.ieee.org/document/9110462
  • Source:-
  • Conference: NOMS 2020
  • Time: 2020-04
  • Institution: Tsinghua University
  • Abstract: This paper designs a HGDom based on a heterogeneous graph convolutional network approach. First, the characteristics of domains and the complex relationships among domains, clients, and IP addresses are analyzed, and a Heterogeneous Information Network (HIN) is introduced to model DNS scenarios. Then, a new representation method MAGCN is proposed. It adopts a meta-path-based attention mechanism, which can handle both node features and graph structure in HIN.
  • Others: This paper has many similarities with deepdom: https://blog.csdn.net/qq_39328436/article/details/124124256

introduce

motivation

  • The character distribution of the domain name, the query behavior of the client , and the resource aggregation of the attacker can all be used for malicious domain detection. To comprehensively consider these three intuitions, HGDom utilizes a HIN model to represent clients, domains, IP addresses, and the different relationships among them.

contribute

  1. A malicious domain detection system HGDom is proposed. HGDom models DNS as HIN, adopts heterogeneous GCN method, makes full use of DNS scenarios, and accurately discovers malicious domains.
  2. A new deep learning method is designed: MAGCN. It adopts a meta-path-based attention mechanism that can jointly process node features and structural information in HIN.
  3. A prototype of HGDom is implemented and the effectiveness and superiority of our proposed method are demonstrated by extensive experiments on two real datasets collected from TUNET and CERNET2.

method

insert image description here

A. Data Collection

In order to obtain more representative information reflecting the actual situation of the network, this paper conducts passive data collection, mainly using three kinds of data:

  1. DNS traffic: It has multiple fields such as src, rcode, TTL, etc., which reflect the communication between clients, resolvers and advanced DNS servers in detail.
  2. passive DNS (pDNS) dataset
  3. DNS log: The DNS server records domain queries in the log, including time, domain name, source IP, etc.
    insert image description here

B. HIN Model Construction

Three node types:

  • client
  • domain name
  • ip address

Four relationship types:

  • Request: The client requests the domain name
  • Mapping: domain name is mapped to ip address
  • Segment: Both domain names belong to the same network segment
  • Alias: Domain A is the cnam of Domain B

For the features of domain nodes, one-hot encoding is used to directly process the name string to obtain the character distribution of the domain name. To improve performance and efficiency, we perform graph pruning preprocessing
insert image description here
on HIN according to the following conservative rules . These nodes are either unlikely to be malicious or do little to help information spread.

  • Inactive clients: inactive users
  • Large clients: Domains with client queries exceeding Kc% (for example, Kc = 90) are mostly forwarders or proxies and should be removed to reduce noise
  • Irregular domains: The domain name does not conform to the name rules (rfc1035), and only one domain name queried by the client will be deleted due to lack of useful information.
  • Popular domains: Domains queried by clients exceeding Kq% (e.g., Kq = 50) will be removed because their risk of being maliciously attacked is very low; otherwise, major attack incidents that are easily detected by IDS will result.
  • Rare IPs: IP addresses that map to only one domain are discarded as they do not help with label propagation.

C. Meta-path Generation

  • PID1: A domain tends to belong to the same category as its CNAME domain.
  • PID2: The set of malicious domains queried by victims of the same attacker may partially overlap, while benign clients have no reason to query them.
  • PID3: Domains that resolve to the same ip address over a period of time tend to belong to the same category.
  • PID4: Adjacent clients are vulnerable to the same attacker.
  • PID5: Attackers tend to reuse their domain or IP resources due to financial constraints.
    insert image description here

D. Proposed method: MAGCN

To fully reflect the observations in DNS data, this paper proposes a heterogeneous GCN method: MAGCN, which consists of two stages:

  • Subgraph extraction, converts HIN into a set of homogeneous networks, where we can perform convolution operations.
  • Attention-based aggregation supports learning the final representation by aggregating subgraphs of different importance.

Subgraph extraction

This step is the difference between deepDom and HGDom. DeepDom selects neighbor nodes that need to aggregate features according to meta-path random walk, while HGDom first converts heterogeneous graphs into homogeneous graphs, and then aggregates them.

  • According to the five meta-paths proposed above, the heterogeneous graph is extracted into a homogeneous graph set (because there are five meta-paths, five subgraphs can be generated in advance). As shown in the table below, the link matrix A'i of each subgraph can be calculated from the switching matrix C of each element path in the original HIN.
    insert image description here

attention mechanism

MAGCN adopts an attention mechanism to add attention coefficients to MAGCN, instead of averaging the results produced by each subgraph, we aggregate them with adaptively estimated weights and get:
H ( l + 1 ) = σ ( X ⋅ W 0 + ∑ i ∈ ∣ P ∣ f ( α i ) ⋅ A i ′ ⋅ H ( l ) ⋅ WPID i ) H^{(l+1)}=\sigma\left(X \cdot W_{0}+\sum_{ i \in|P|} f\left(\alpha_{i}\right) \cdot A_{i}^{\prime} \cdot H^{(l)} \cdot W_{PI D_{i}}\ right)H(l+1)=σXW0+iPf( ai)AiH(l)WPIDi

The specific algorithm is as follows:
insert image description here

experiment

insert image description here
insert image description here

in conclusion

  • This paper proposes a malicious domain detection system HGDom, which naturally models DNS scenarios as HINs from three aspects: name feature distribution, attacker resource aggregation, and client query behavior to achieve richer information fusion.
  • We propose MAGCN and apply it to HIN to classify domain nodes taking into account the features of both domains and their associations. MAGCN's is a variant of the GCN model with a meta-path based attention mechanism.
  • We conduct sufficient experiments with DNS data collected from TUNET and CERNET2 to demonstrate the accuracy and superiority of HGDom, where MAGCN outperforms current state-of-the-art network embedding methods, and HGDom outperforms existing graph-based mining methods system.
  • Currently, HGDom contains only three components (client, domain, and IP address). In future work, we plan to add other types of DNS-related data, such as the registration information dataset WHOIS, for a more comprehensive analysis.
  • Furthermore, we also intend to propose more advanced data mining methods in the future to further improve the efficiency and scalability of HGDom.

Guess you like

Origin blog.csdn.net/qq_39328436/article/details/124321104