Technology | Using Deep Learning to Detect DGA (Domain Name Generation Algorithm)

Abstract: DGA (Domain Name Generation Algorithm) is a technical method that uses random characters to generate C&C domain names, thereby evading domain name blacklist detection. For example, a DGA-generated domain xeogrhxquuubt.com created by Cryptolocker, if our process tries to connect to others, then our machine may be infected with Cryptolocker ransomware.

DGA (Domain Name Generation Algorithm) is a technical method that uses random characters to generate C&C domain names, thereby evading domain name blacklist detection. For example, a DGA-generated domain xeogrhxquuubt.com created by Cryptolocker, if our process tries to connect to others, then our machine may be infected with Cryptolocker ransomware. Domain blacklists are often used to detect and block connections from these domains, but are not effective for the constantly updated DGA algorithm. Our team has also been doing extensive research on DGA and published an article on arxiv on prediction domain generation algorithms using deep learning.

In this article, I will introduce a simple and effective DGA generation domain detection technology. We will use neural networks (or deep learning as we call them), more specifically Long Short Term Memory (LSTM) networks, to help us detect the DGA generative domain. First we will discuss the advantages of deep learning, and then I will further verify my statement with examples.

If you don’t know much about machine learning before, then I suggest that you read the three articles I have published about machine learning before reading this article, which will help you understand better.

Benefits of Long Short Term Memory (LSTM) Networks

Deep learning has arguably dominated the machine learning community in recent years. Deep learning is a method in machine learning based on representational learning of data. The benefit is to replace handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction. Deep learning has been very popular for the past four or five years as it has continued to evolve over the decades. Coupled with the continuous upgrading and optimization of hardware (such as the parallel processing improvement of GPU), it is also possible to train complex networks. LSTM is a special type of RNN that can learn long-term dependent information such as text and language. LSTM is one such trick for implementing recurrent neural networks, meaning neural networks that contain recurrent. LSTMs are very good at learning patterns over long periods of time like text and speech. In the examples in this article, I will use them to learn patterns in character sequences (domain names) that help us identify which are DGA-generated domains and which are not.

One of the great benefits of using deep learning is that we can eliminate the tedious process of feature engineering. Whereas if we use conventional methods to generate a long list of features (such as length, vowels, consonants, and n-gram models), and use these features to identify DGA-generating and non-DGA-generating domains. Then security personnel need to update and create new signature libraries in real time, which will be an extremely arduous and painful process. Second, once an attacker masters the filtering rules in it, the attacker can easily evade our detection by updating his DGA. The automatic representation learning ability of deep learning also allows us to adapt to changing opponents faster. At the same time, it also greatly reduces the huge investment of our human and material resources. Another advantage of our technique is that only domain names are identified without using any context function like NXDomains
Another advantage of our technique is that we only classify domain names without using any context function like NXDomain. The generation of contextual features often requires additional expensive infrastructure (such as network sensors and third-party reputation systems). Surprisingly, for LSTMs without contextual information, the performance is significantly better than them. If you want to learn more about LSTM, I recommend everyone to follow: colah's blog and blogdeeplearning.net.

What is DGA?

First of all, we have to figure out what is DGA? and the importance of DGA detection. Attackers often use domain names to connect malicious programs to C&C servers in order to control the victim's machine. These domains are often encoded in malicious programs, which also gives attackers a lot of flexibility, they can easily change these domains as well as IPs. For another hard-coded domain name, it is often not used by attackers because it is easily detected by blacklists.

With the DGA domain name generation algorithm, attackers can use it to generate pseudo-random strings used as domain names, which can effectively avoid the detection of blacklists. Pseudo-random means that the sequence of strings appears random, but can be repeatedly generated and replicated because its structure can be predetermined. This algorithm is often used in malware and remote control software.

Let's take a brief look at what the attacker and the victim have done. First the attacker runs the algorithm and randomly selects a small number of domains (probably just one), then the attacker registers that domain and points it to its C2 server. On the victim side the malware runs the DGA and checks for the existence of the outputted domain, if it is detected as registered then the malware will choose to use that domain as its command and control (C2) server. If the current domain is detected as unregistered, the program will continue to check other domains.

By collecting samples and reverse engineering the DGA, security personnel can predict which domains will be generated and pre-registered in the future and blacklist them. But DGA can generate thousands of domains in a day, so it's impossible for us to repeatedly collect and update our list every day.

image


Figure 1 shows the workflow of many types of malware. As shown, the malware attempts to connect to three domains: asdfg.com, wedcf.com, and bjgkre.com. The first two domains were not registered and received an NXDomain response from the DNS server. The third domain is already registered, so the malware uses that domain name to establish a connection.

Create LSTM

training data

Any machine learning model needs training data. Here we will use the Alexa top 1 million websites as our raw data. We've also combined several DGA algorithms in Python, which you can get on our github, and we'll use these algorithms to generate malicious data.

Tools and Frameworks

Keras toolbox is a Python library that can greatly facilitate our writing neural networks. Of course there are many other similar tools besides Keras, here we prefer Keras because it is easier to demonstrate and understand. The underlying library of Keras uses Theano or TensorFlow, these two libraries are also known as the backend of Keras. Whether it is Theano or TensorFlow; it is a "symbolic" library, here we can choose to use it according to our own preferences.

Model Code

The following is the model we built with Python code:


image


Let me give a brief explanation of the above code:

In the first line we define a basic neural network model. On the next line we add an embedding layer. This layer converts each character to a vector of 128 floats (128 is not a magic number). Once this layer is trained (input characters and output 128 floats), each character is basically looked up once. max_features defines the number of valid characters. input_length is the maximum length string we will pass to the neural network.

The next line adds an LSTM layer, which is a very important step. 128 represents the dimension of our internal state (this happens to be the same size as our previous embedding layer). The larger the dimension, the more specific the description of the model, where 128 is just right for our needs.

The dropout layer is to prevent the model from overfitting. You can delete it if you don't think it's necessary, but it's still recommended that you use it.

The dropout layer is placed before the Dense layer (fully connected layer) of size 1.

We added an activation function, sigmoid, that "compresses" the input continuous real values ​​between 0 and 1. If it is a very large negative number, the output is 0; if it is a very large positive number, the output is 1.
We use an optimizer to optimize the cross-entropy loss function. RMSProp is a variant of stochastic gradient descent and tends to be very effective for recurrent neural networks.

Preprocessing

code Before we can actually start training the data, we must do some basic data preprocessing. Each string should be converted to an int array representing each possible character. This encoding is arbitrary, but should start at 1 (we reserve 0 for the end sequence token) and be continuous. The following string of code can help us achieve this process.

image

Next, we pad each int array to the same length. Padding allows our toolbox to better optimize computations (in theory, LSTMs do not need padding). Here Keras provides us with a very useful function:


image

maxlen represents the length of each array. This function will fill with 0 and crop when the array is too long. It is very important that our integer encoding is 1-based before, because the LSTM should learn the difference between padding and characters.

Here, we can use the ROC curve to split our test and training sets, as well as evaluate our performance.


image

comparison

In our article on arxiv, we compared our simple LSTM technique to three other techniques. To make this article more accessible, we only compare the results with a single method using the logistic regression distribution. This technique is also better than the current state of the art (but still not as good as LSTM). This is a more traditional feature-based approach, where a feature is a histogram (or raw count) of all double-bytes contained in a domain name. You can get the implementation code of the prediction domain generation algorithm using LSTM on our github.

result

Now let's take a look at the ROC curve and AUC value of the classifier performance indicators:

image


can see an AUC value of 0.9977, indicating that our classification effect is very good, and we only used a few lines of code to achieve this effect. We actually completed an in-depth analysis of a large and diverse dataset and observed 90% of detections with a false positive rate of 1 in 10,000.

Summary

We propose simple techniques for detecting DGAs using neural networks. The technique does not need to use any contextual information (such as NXDomains and third-party reputation systems), and the detection performance is also far better than some existing techniques. This article is only a partial summary of our research on DGA, you can click to read our full article on arxiv. At the same time, you can also do further research on the code we publish on github.

Source: 36 Big Data

If you find any content suspected of plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, this community will immediately delete the allegedly infringing content.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326181477&siteId=291194637