[Second prize scheme] The team's problem-solving ideas for the artificial intelligence-based vulnerability data classification competition question "Tao can be Tao, very Tao"

2022 CCF BDCI Contest·Digital Security Open Competition "Artificial Intelligence Based Vulnerability Data Classification" The second prize team " Dao Ke Dao, very Dao" team winning proposal, the address of the competition problem: http://go.datafountain.cn/s57

Team Profile

The team has rich experience in competitions and projects. He has won top results in many AI competitions, including the first place in Aliyun Tianchi security malicious program detection, the third place in HKUST Xunfei Malware Classification Challenge, the fourth place in CCF's artificial intelligence-based malware family classification, the fourth place in HKUST Xunfei Event Extraction Challenge, the fourth place in HKUST Xunfei Alzheimer's Syndrome Prediction Challenge, and the fifth place in Datacon Big Data Security Analysis Competition. Team members have a total of more than ten invention patents, and have unique insights into traditional machine learning and deep learning.

Summary

As the key information infrastructure is equipped with a large number of information assets in the process of digital, network, and intelligent transformation, its network system is becoming more and more complex, and the threat of vulnerabilities as "accompanying bodies" has become increasingly prominent. In order to cope with the increasingly severe security challenges, strengthening the construction of security vulnerability knowledge base is the only way. Among them, the vulnerability data in the CVE vulnerability platform is the vulnerability knowledge information disclosed by international authorities. The platform has multi-dimensional and diverse vulnerability information. Information extraction is required from vulnerability data for better understanding and ongoing research.

In the scenario of information extraction, the extraction method based on artificial rules is traditionally adopted, which has relatively low development efficiency and poor generalization. The use of machine learning-based natural language processing (NLP) methods can better summarize and learn massive data, greatly improving the generalization ability of information extraction.

Although pre-training models have made great progress in various fields, especially in natural language processing applications, considering the limited computing resources in some actual industrial scenarios and the interpretability of machine learning models, feature engineering and traditional machine learning models are used to classify vulnerability data.

In the scenario of information extraction from vulnerability data, there may be problems such as uneven distribution of sample categories and noise in labeled data. This paper proposes a solution based on noise data correction, and then extracts features such as key verb phrases and noun phrases from the text, and then uses models of different complexity according to the difficulty of different tasks, and finally achieves better information extraction results.

Key words

Vulnerability information extraction, noisy data, feature engineering, efficiency

foreword

In the field of network security, vulnerabilities are often regarded as the "killer" weapon by the attacker, and the "root of all evil" by the defender. Although the vulnerability itself does not cause harm, once exploited, it is very likely to bring serious threats. In the process of digitization, networkization, and intelligent transformation of key information infrastructure, a large number of information assets are allocated, and its network system is becoming more and more complex, and the threat of vulnerabilities as "accompanying bodies" is becoming increasingly prominent.

In order to cope with the increasingly severe security challenges, strengthening the construction of security vulnerability knowledge base is the only way. Among them, the vulnerability data in the CVE vulnerability platform is the vulnerability knowledge information disclosed by international authorities. Vulnerability information on the platform includes CVE numbers, vulnerability scores, and vulnerability descriptions, among which the vulnerability description includes the conditions for exploiting the vulnerability, the scope of impact, and the effect (harm) that the vulnerability can achieve. In order to better understand and continue research, it is necessary to extract information from vulnerability data. In the scenario of information extraction, the traditional extraction method based on manual rules is used, which has low development efficiency and poor generalization. The use of machine learning-based natural language processing (NLP) methods can better summarize and learn massive data, greatly improving the generalization ability of information extraction.

In the scenario of information extraction from vulnerability data, there may be problems such as uneven distribution of sample categories and noise in labeled data. This paper proposes a solution based on noise data correction, and then extracts features such as key verb phrases and noun phrases from the text, and then uses models of different complexity (logistic regression, random forest, XGBoost) according to the difficulty of different tasks, and finally achieved the first place in the A list and the second place in the B list.

overall scheme design

This paper uses the NLP method to mine and extract the description information of the vulnerability, so as to obtain important information such as the attacker's privilege (Privilege-Required), the attack vector medium (Attack-Vector), and the result of the vulnerability exploitation (Impact). The program is divided into five modules: data analysis module, data preprocessing module, feature extraction module, model training module, and model prediction module. The overall flow chart is shown in Figure 1 below:

2.1  Data analysis module

In the vulnerability data classification task corresponding to this paper, three attributes need to be classified at the same time. Among them, the classification of the Attack-Vector attribute belongs to the two-category task, the classification of the Privilege-Required attribute belongs to the four-category task, and the classification of the Impact attribute belongs to the multi-level classification task. There are 4,499 training data sets, 1,794 A-list test sets, 2,686 B-list test sets, and 6w additional unlabeled data. Analyzing the data shows that there are three major difficulties in this task:

(1) There is a large imbalance in the distribution of training set samples. Taking the Attack-Vector attribute classification as an example, there are 4279 cases in the remote category and 220 cases in the non-remote category; while in the Privilege-Required attribute classification, there are 2685 cases in the access category, 945 cases in the Nonprivileged category, 799 cases in the unknown category, and 70 cases in the admin/root category. The Impact attribute has a hierarchical structure. If you do not consider its hierarchical relationship and directly count each specific category, the number of Privileged-Gained(RCE)_unknown in the most category is 1272, and the number of information-disclosure_other-target(credit)_admin/root in the least category is only 3 samples

(2) The distribution of the training set and the test set is inconsistent. After fine-tuning the training set using the pre-trained model, observe the effects of the verification set and the test set respectively, and find that there is a large difference between the two. After data screening and analysis, it is found that there are some noise data in the training set data

(3) The amount of training data is not much, but there is a large amount of unlabeled data. How to make better use of unlabeled data is the key to improving the effect.

2.2 Data preprocessing module

First remove the text content irrelevant to this task, and use regular expressions to delete special punctuation marks (such as single quotation marks, double quotation marks, exclamation marks), software version numbers (such as 17.1r3, 4.2.x), time information (11:38:17, jul 23 14:16:03), unimportant comments (such as note: this issue is due to an incorrect fix for cve-2012-5643), Affects the version range (such as this issue affects juniper networks junos os on acx500 series, acx4000 series: 17.4 versions prior to 17.4r3-s2.), software information with bugs fixed (such as fixed in vault and vault enterprise 1.7.6, 1.8.5, and 1.9.0.), etc. content.

The noise data in the training set are then corrected. First, after randomly sampling a small proportion of data, according to the understanding of the topic, this part of the data is corrected by manual verification, and it is used as a seed sample. For example, when there are two or more categories in Impact, the label is marked as a low-priority category, and the correct label is the highest priority category among multiple categories. The seed sample (labeled as 1) and the remaining samples in the sampled samples (labeled as 0) are trained, and the unsampled samples are predicted to obtain sample data similar to the seed samples in the unsampled samples. Since a single sampling has a certain degree of uncertainty, it is repeated three times, and the samples that are predicted to be 1 at the same time for three times are taken as the samples that need to be corrected. The wrong samples in the above samples are corrected by expert experience, so as to achieve the purpose of correcting the noise data.

2.3 Feature Extraction Module

The feature extraction module performs further feature extraction on the preprocessed data. This module not only extracts some simple statistical features, such as the total number of characters in the text, the total number of words, the number of sentences, etc. Moreover, the analysis of the data labeling results (red part) provided by the contest party shows that the important information comes from noun phrases, verb phrases and some keywords. The spacy library can not only extract the above-mentioned various structural phrases, but also extract information based on a pre-trained model built from massive data, thus ensuring the effectiveness and integrity of information extraction.

2.4 Model training module

Since the difficulty of classification of the three attributes of Attack-Vector, Privilege-Required and Impact is different, models of different complexity (logistic regression, random forest, XGBoost) are used according to the difficulty of the task.

In order to alleviate the situation of class imbalance, the generalization ability of the model to the minority class is finally enhanced by using methods such as oversampling the minority class samples, increasing the weight corresponding to a small number of sample classes, and undersampling the majority class samples.

Since the additional unlabeled data is 60,000, it is much larger than the number of samples in the training set. Therefore, the richness and diversity of training data can be increased by using semi-supervised learning. The specific method is to directly label the data with high confidence in the prediction set and put them into the training set through supervised learning, and repeat the cycle many times to obtain a new training set.

2.5 Model Test Module

The prediction module extracts features from the test set data according to the above-mentioned feature extraction module, and uses the parameters obtained by the training module to predict the test data. Finally, the prediction set A list and B list achieved the first and second results.

thank you

I am very grateful to the Organizing Committee of the Big Data and Computational Intelligence Competition of the China Computer Federation for the meticulous preparation and organization of the Artificial Intelligence-based Vulnerability Data Classification Competition. Through careful analysis and in-depth thinking of the competition questions, and after many model iterations and verifications, an innovative solution was finally proposed.

I am very grateful to my family, colleagues and friends for their strong support and selfless help.

reference

[1] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016:785-794.


I am @DataFountain , an industry-leading big data competition platform   . Government, enterprise, school and military units are welcome to cooperate in organizing the competition, and promote outstanding data talents to take the lead!

Guess you like

Origin blog.csdn.net/DataFountain/article/details/131807248