Introduction to dga (Domain Generation Algorithm)

table of Contents

I. Introduction

Two, background

Three, detection

Four, development

Five, summary


I. Introduction

Malware has now become the number one public enemy threatening network security. In order to evade the detection of security facilities, its production process has become more and more complicated. One of the typical methods is to integrate DGA (Domain Generation Algorithm) algorithms in the software to generate rapidly changing domain names. As a backup or main communication method with C2 servers, this method can construct a more robust botnet and achieve continuous control of infected broilers. Correspondingly, the research on the DGA algorithm is now also a hot topic in the security circle. Academia and industry also have a lot of DGA domain name detection work, but there are too many false positives in actual use. Because traditional DNS uses plain text for data transmission, it has caused serious user privacy leakage. DoT (DNS-over-TLS) and DoH (DNS-over-http) protocols have successively passed RFC standards to protect user privacy, but on the other hand , The use of encrypted DNS will bring new challenges to the detection of DGA domain names.

This article first briefly introduces the background of DGA domain names, then sorts out and summarizes various DGA domain name detection methods, and selects one of them for actual product testing, conducts in-depth analysis of the results and gives corresponding suggestions, and then briefly introduces Encrypted DNS brings challenges to DGA detection and methods to detect encrypted DGA traffic, and finally summarizes the current deficiencies and unsolved problems of DGA domain name detection.

Two, background

In recent years, the number and complexity of malware have continued to grow, giving rise to a large number of black industry chains and cybercrime. According to statistics, the capital market for cyberspace crime has reached 150 billion U.S. dollars by 2018[1], in order to maintain continuous For economic benefit or other purposes, the attacker's management of broilers is an important issue for botnet control. Effective management of broilers is not only conducive to the launch of various types of attacks, but also prolongs the time for the attack to be discovered, and realizes the reality of the attacker. The hiding of identity. Modern malware generally uses the DGA algorithm to establish communication with the C2 server to achieve the above purpose.

Principles of DGA Domain Names

The principle of the malware using the DGA algorithm to communicate with the C2 server is shown in Figure 1 [2]. The client generates a large number of candidate domain names through the DGA algorithm and performs queries. The attacker and the malware run the same DGA algorithm to generate the same In the list of alternative domain names, when you need to launch an attack, you can select a few of them to register to establish communication, and you can apply fast-change IP technology to the registered domain name to quickly change the IP, so that both the domain name and the IP can be changed quickly.

Obviously, in this way, traditional blacklist-based protection methods cannot work. On the one hand, the update speed of the blacklist is far behind the generation speed of DGA domain names. On the other hand, the defender must block all DGA domain names. In order to block C2 communication, therefore, the use of DGA domain names makes it easy to attack and difficult to defend [3].

Figure 1 Working principle of DGA domain name

DGA domain name classification

The DGA algorithm consists of two parts, the seed (algorithm input) and the algorithm. DGA domain names can be classified according to the seed and algorithm. DGA domain names can be expressed as AGD (Algorithmically-Generated Domains).

2.1 Classification by seed

The seed is one of the input parameters of a DGA algorithm shared by the attacker and the client malware. The DGA domain name derived from different seeds is different. Generally speaking, seeds can be classified as follows:

1. Time-based seed (Time dependence) . The DGA algorithm will use time information as input, such as the system time of the infected host, the http response time, etc.

2. Whether there is certainty (Determinism) . The input of the mainstream DGA algorithm is definite, so AGD can be calculated in advance, but there are also some DGA algorithms whose input is uncertain, such as: Bedep[4] uses the foreign exchange reference exchange rate issued daily by the European Central Bank as a seed, Torpig[ 5] Using twitter keywords as seeds, domain names can only take effect when registered within a certain time window.

According to the classification method of seeds, DGA domain names can be divided into the following 4 categories:

1. TID (time-independent and deterministic) , not related to time, can be determined;

2. TDD (time-dependent and deterministic) , which is related to time and can be determined;

3. TDN (time-dependent and non-deterministic) is related to time and cannot be determined;

4. TIN (time-independent and non-deterministic ), which is not related to time and cannot be determined;

2.2 Classification according to the generation algorithm

Existing DGA generation algorithms can generally be divided into the following four categories:

1. Based on arithmetic. This type of algorithm will generate a set of values ​​that can be represented by ASCII encoding, thus forming a DGA domain name, which has the highest popularity.

2. Based on hash . Use the hexadecimal representation of the hash value to generate the DGA domain name, and the used hash algorithms are often: MD5, SHA256.

3. Based on dictionary . This method selects words from a proprietary dictionary for combination, reduces the randomness on the characters of the domain name, and is more confusing. The dictionary is embedded in a malicious program or extracted from a public service.

4. Based on permutation and combination . Perform character permutation and combination on an initial domain name.

According to different seeds and generation algorithms, DGA domain names can choose a combination of different seed types and algorithm types, so the final DGA domain name has a high diversity of generation forms.

DGA domain name survival time

Plohmann Daniel et al. [3] performed reverse analysis on 43 malware families, implemented the DGA algorithm and analyzed more than 100 million DGA domain names, combined with WHOIS information, and calculated the distribution of the survival time of different DGA family domain names. The survival details of each DGA family's domain name are not listed here, and interested readers can directly read the original text.

In summary, the survival time of DGA domain names is generally short. Most domain names have a survival time of 1-7 days. Therefore, the short survival time of DGA domain names puts forward higher requirements on the real-time detection of the defender. It is necessary to detect the DGA domain name in the shortest possible time and take corresponding measures to effectively reduce the risk.

Three, detection

Since the exposure of the DGA domain name, its detection work has continued. In different scenarios and different periods, the detection methods have also shown certain differences. This section sorts out related work and tests in actual products, and gives the algorithm encountered in actual scenes. Problems and optimization suggestions.

related work

According to different detection methods, DGA domain name detection can be roughly divided into the following two types: based on text analysis and based on behavior analysis.

Representative work based on text analysis includes [9][10][11], [9] analyzes the difference in character distribution between DGA domain names and normal domain names, and classifies domain names generated by IP in batches, [10] analyzes by LSTM algorithm The difference between a DGA domain name and a normal domain name can determine whether each domain name is a DGA domain name. Since a large number of NXDomains are generated during the request process of DGA domain names, [11] classifies NXDomain to effectively identify DGA domain names.

Representative work based on behavior analysis includes [12][13], [12] clustering and classifying NXDomains generated by the same host, you can find the infected host, and further discover the C2 domain name, [13] turn the detection problem into a graph reasoning The problem is to construct a graph of the relationship between the host and the domain name from the proxy log, use some real information as the seed as the input of the graph, and then use the belief propagation algorithm to estimate the edge probability that the domain name is malicious.

product testing

This part is divided into two parts, offline model training and online product testing. We use deep learning technology to automatically extract features, classify NXDomain, and find out the DGA domain name among them [15].

2.1 Model training

Data set: We collected DGA domain names from DGArchive[14], which contained about 45.7 million DGA domain names, including 62 DGA families. In addition, we collected a large number of benign NXDomains, including 15.3 million domain names. Using these data as the original input, we performed Supervised learning.

Preprocessing: First, we remove a small amount of noise in the benign NXDomain data set and the noise in DGArchive to construct a purer data set, and then perform one-hot encoding on the characters contained in the data set as the input of the neural network model.

Data sampling: Due to the inconsistency of the amount of data in each category, in order to make the classification results not biased, we determine a threshold, and down-sample the categories above the threshold to ensure that the number of domain names in each category is the same.

Model selection: Choose CNN, LSTM, and BiLSTM three neural network models for testing. The last layer of the neural network is selected separately: sigmod and sofmax two functions to achieve two-class and multi-class, and to adjust the model parameters.

Cross-validation : Perform 5-fold cross-validation on the data set.

Experimental results. Figure 2 shows the multi-class statistical results of the three neural network models. Due to the large number of DGA families included, the overall performance of the multi-class experiment is not good. We check the classification of each DGA family. As shown in Figure 3, we randomly select 33 DGA families and benign domain names for display (34 categories). We can find that many DGA families can achieve almost 100% classification accuracy, especially It is a dictionary-based DGA family, such as suppobox, banjori, volatile, etc.

Figure 2 Multi-class statistical results of different neural network models

Figure 3 Confusion matrix of CNN multi-class (34 classes)

We combine all the DGA family categories in the multi-category into one category, and compare the experimental results of the two-category and FANCI [11], as shown in Figure 4. The experimental results show that we have a high accuracy rate in the two-category, and It is better than the existing DGA domain name detection scheme based on NXDomain.

Figure 4 Two classification results

2.2 Product testing

In the offline model, we found that the performance of various deep learning algorithms is not much different. The fastest CNN model is selected for product log detection. The detection process is as follows.

Select the product DNS logs in 2019, select NXDomain among them for detection, a total of 6.69 million domain names. After the following filtering methods:

Local whitelist. We have constructed a local whitelist list. If the second-level domain name can match the list, the domain name will be filtered. After filtering, there are 5.45 million domain names.

The top-level domain is legal. There are a large number of DNS requests for which top-level domain names do not exist in the data. These domain names are not DGA domain names, and 4.63 million domain names remain after filtering.

Alexa_top_10000. The domain names of Alexa top 10000 are generally considered to be visits to large enterprises. In theory, there is no DGA domain name, and 4.44 million domain names remain after filtering.

The remaining 4.44 million domain names were tested through the saved CNN model, and the test results showed that there were 420,000 DGA domain names.

2.3 Result verification

2.3.1 Determine false positives

Analyze the detected highly suspicious DGA domain names and formulate the following strategies for false positives:

1. Remove the domain name whose secondary domain name is "afftb288.com". In the test results, there are 380,000 primary domain names of "afftb288.com", such as "41959214.afftb288.com" and "34308479.afftb288.com", although this domain name Belonging to a gambling website, but it is determined that the domain name does not meet the characteristics of the DGA domain name.

2. Remove the domain name whose main domain name is less than or equal to 5. As DGA domain name should avoid collision with normal domain name, the length of DGA domain name used by malware is generally greater than 5, so delete such domain name, such as "baidu.com", main The length of the domain name "baidu" is <=5 and needs to be filtered.

3. Delete the domain name of the regular organization, there are a small number of domain names of the regular organization in the test results, such as "ztmbec.com", "speedy.com.ar", etc.

After the above filtering methods, there are 2.7w remaining domain names to be confirmed

2.3.2 Determine DGA domain name

According to public threat intelligence, we found two types of DGA domain names,

The first category contains 2572DGA domain names, of which there are 206 main domain names, such as:

lhsjtcl.com

dfwpmpm.me

lzxemfc.com

tmhufuf.com

zccaotl.com

orahcre.org

eoerkfc.com

pycbumk.com

The second category is mining Trojan C2, such as zeruuoooshfrohlo.su, about 110,

2.3.3 Continuous monitoring

For the remaining more than 20,000 highly suspicious DGA domains, we give the confidence score of the DGA domain name and possible malware family information, and conduct follow-up operations such as continuous monitoring and analysis of association with malicious samples.

Four, development

DGA domain names have entered people’s field of vision from the initial pseudo-random character string. Because the character distribution of the domain name generated by this method is obviously different from that of normal domain names, it is easy to be detected. Attackers switch to the dictionary-based DGA domain name algorithm. The character distribution is as close as possible to the normal domain name, which greatly reduces the randomness of the characters. In recent years, with the formulation and deployment of DNS encryption protocols such as DNSCrypt, DoT, and DoH, more and more malware uses encrypted traffic to escape monitoring, and studies have shown that malware using DoH has been discovered [6].

With the further deployment of encrypted DNS, it is expected that more DGA domain names will be transmitted through the encrypted DNS protocol in the future, and it will be more difficult to detect the use of DGA domain names botnet. The article [7] uses the pydig tool to send requests to DGA domain names and normal domain names. Generate encrypted traffic, do further analysis to remove invalid information at the TLS handshake stage. The article analyzes the packet size and finds that there is a significant difference between the encrypted data packet size distribution of the DGA domain name and the Alexa domain name. Figure 5[7] shows that, The data packet size can be used as an important basis for distinguishing DGA domain names in encrypted DNS traffic.

Figure 5 Data packet size distribution of DGA domain name and Alexa domain name transmitted through DoT protocol

Although the article does not discuss in depth the encryption protocol, whether to use padding technology, how to identify DoH traffic and other factors, these factors have a great influence on the size of the data packet, and the impact on the detection result needs to be further studied.

Five, summary

Since the DGA domain name was exposed in 2009 [5], it has been developed for a full 10 years. Although there is a large amount of literature on the nature, detection, defense, tracking and other topics of DGA domain names, there are also a large number of institutions involved in DGA domain name blocking and sinkholes. Work, but the DGA domain name is still widely used as a mainstream means of communicating with C2 servers. Due to the limitation of the author’s ability, it is impossible to give the reason for this phenomenon. In the end, it is the effectiveness of the method in academic papers in actual use. Weak, or improper deployment and disposal methods in the industry, need to be further explored. In short, we still have a long way to go for the research and governance of DGA domain names.

 

 

Guess you like

Origin blog.csdn.net/whatday/article/details/114690030