Artificial intelligence, application and improvement in the fields of big data and network security

The detection of encrypted traffic in the field of network security is a common topic. With the development of artificial intelligence, different solutions have been brought to the same problem.

In recent years, more and more researchers have tried to apply AI to existing network security fields. From some earlier dodgy article Artificial intelligence cybersecurity? Please be more serious! , so far some are real experimental results. As you can see, science is indeed progressing. The application of AI in the field of network security is indeed increasing.

A reliable article published by NSFOCUS Tianshu Lab is "Research on Fine-grained Classification of IoT Malware Families Based on Deep Learning". Through specific experiments to verify my own ideas, we can see that the author is really using AI to do something, and his ideas for solving problems are also very worth learning. Of course, from the perspective of business and professional AI algorithms, there are still deficiencies and room for improvement. Today we are going to talk through the air and talk about some of the ways in this method in detail.

1 In the article, the main idea of ​​identifying Trojans

In one sentence, the main idea of ​​this article "Research on Fine-grained Classification of Internet of Things Malware Families Based on Deep Learning" is: split the pcap packets of Trojan horse communication, and then use the AI ​​image classification model for training. The principle is to use the classification function of AI to find clues from the communication data of the Trojan horse.

This kind of thinking is very creative, and starting from traffic measurement is also one of the main solutions to network security problems. What is commendable is that the author has also done some experiments to verify the feasibility of his ideas. As shown in the picture:
insert image description here

It can be seen from the results that this approach has a certain recognition ability, but it also exposes certain problems, such as the classification problems of categories 4, 8, 10, and 11.

Faced with such a result, I couldn't help but want to analyze it in depth. We think that the classification model is fine, and the problem should lie in the data processing link at the business level.

2 Improvement from the business perspective

The method of processing business data is described in the article: "First investigate the current popular IoT malware families, then query and download samples on the internal platform, download samples of 12 malicious families in total, and finally return the corresponding pcap data package. And use the USTC-TK2016 toolset to preprocess the data"

To understand this problem, you need to understand what the USTC-TK2016 tool will do.

2.1 What did the USTC-TK2016 toolset do?

The USTC-TK2016 tool set should cut the first few bytes of all sessions together, remove the excess, and fill in 0 if it is not enough. This approach is actually very similar to the data processing of DPI. In the DPI system, the focus is on analyzing the first few bytes of each flow. Of course, in actual situations, the author may not do this. But we can't see the specific method from the article, so let's analyze the method first, which has nothing to do with the author, but only a technical discussion.

What will happen if you do this? let's look down

2.2 Problems with data collection

According to the method mentioned above, the essence is to regard a conversation as a recognition object, and then use this to train. That is, take out 784 bytes from a session according to certain rules, convert it into a picture with a size of 28*28, and then classify it in the form of pictures.

From the perspective of massive data analysis, it is not advisable to rely on a row of data in a session for input. Because each APP has multiple session possibilities, and in some cases, multiple APPs may also have similar session data.

Another problem is that when these Trojan horses run in the so-called sandbox, sometimes the traffic generated is not necessarily all Trojan horse traffic, and some may also be background traffic, such as the win10 it is attached to or a certain When a linux operating system is used, the system itself will generate some traffic. There is no mention of cleaning noise in the article. If these denoising actions are not performed when processing data, the recognition effect will also be affected.

2.3 Cleaned out the key information

In the data cleaning stage, the author randomly replaced information such as IP addresses unique to the traffic data. This operation caused the model to lose a large number of clues in the recognition task. In fact, in the communication traffic of the Trojan horse, the remote IP plus port is a particularly fixed set, which is the most favorable clue to identify the Trojan horse.

Generally speaking, Trojan horses will communicate with C2 servers (command and control servers), and there are not many such C2 servers in the world. It is also fixed, so directly using a few addresses and adding a few fixed ports can solve the problem immediately.

Of course, if you use it directly in this way, you can determine the model, and you don't need artificial intelligence at all! In fact, this is also the case.

I remove the most useful information by myself, and then observe and classify it through subtle features. This feels like a problem for myself. It goes against the essence of using AI: AI should be used according to actual needs, not for the purpose of using AI.

2.4 AI should be used in suitable scenarios

From the perspective of AI application scenarios, this article is a feasible solution, but not the best solution. In fact, with the current technology, DPI has already done a perfect job of application recognition at this level.

At the beginning of the article, there are some problems with the phenomenon that the traditional DPI system cannot recognize encrypted data. Because in the scenario provided by the author, 80% or 90% of the traffic involved is identified using this deterministic DPI model. Now, on top of the recognized results, it is a bit unnecessary to use artificial intelligence to try to recognize it again. However, from an academic point of view, the experiments done by the author are still valuable. At least it proves the feasibility of using AI in the field of application identification through traffic.

This article intends to use its method to negate DPI. In fact, he did not prove why he can solve the problem of encrypted data identification. Because for the encrypted data, various graphs may appear.

But purely for this specific scene in the article, it can be completely described by directly using polynomial algorithm in essence. In fact, artificial intelligence should not be used.

2.5 How to use AI in the security field?

As far as this article is concerned, it is difficult to make the data from a certain source address an identification target, or to identify each session from a certain source address individually.

If you want to do it, you can change your mind and use a certain address at the remote end as a recognition target, which will be better. Because it is better able to capture the individual intentions in units of IP. And the monitoring session should not be just one, but should be qualitative based on the overall behavior generated after the merger of multiple sessions.

insert image description here

It is an excellent idea to convert data into pictures for classification, because the pictures themselves are 2-dimensional data. To make full use of this feature, during the conversion process, each session is used as a row, and multiple sequences are used as columns. Restore the information of the event itself to the greatest extent. Instead of directly converting the 1-dimensional features of a single piece of data into 2-dimensional for processing.

2.6 Better AI technical support

As an AI person, I am very happy to see people in the security field using AI to do experiments and research in a down-to-earth manner. The development and promotion of AI technology is inseparable from the efforts of colleagues of our generation.

The above are just some suggestions for the technical methods in this article based on our past experience. Since we are not from a safe background, we still hope to correct us if we are not good enough.

Below, we will present some of our experience in the direction of AI, hoping to help colleagues in the industry.

1. For super multi-category recognition problems, especially APP recognition problems, it is not recommended to use classification algorithms. Because there are too many types, and most classifications rely on supervised learning, it is difficult to train a classification model without seeing samples. In particular, there is a problem with the version update of the APP. A new version will change a batch of features, which cannot be correctly positioned and captured.

2. For this kind of super multi-category recognition problem, it is recommended to use representation learning to do similarity matching by fitting spatial distance. This can deal with various unknown situations, and can make a smooth transition for unsupervised training

3. Although the classification model mentioned in the article is very mature, it is too old. The input format used is rarely used today. It is recommended to use current mainstream classification models such as ResNet and NasNet. The current state-of-the-art classification model is EfficientNet. "Technical Interpretation of EfficientNet Series Models" suggests paying attention, using this model can greatly improve the classification effect.

4. Recommend some excellent unsupervised training models for reference, such as self-encoding, variational self-encoding, maximizing mutual information, f-gan, etc. These technologies can provide a lot of inspiration in data research on network security.

insert image description here

5. For unknown classification, the identification and processing of unknown threats is also the direction of our current research. This relatively challenging task aroused our great interest. At present, we mainly use related technologies of graph neural network and zero-shot learning. At present, some achievements and knowledge points are also written in the book, which will be published in the future. At the same time, I also hope that colleagues who have researched in this area can communicate and learn from each other.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_43672348/article/details/106213972