[Second prize scheme] System access risk identification "LOL four missing one" team problem-solving ideas

The 10th CCF Big Data and Computational Intelligence Competition (2022 CCF BDCI) has been successfully concluded. The competition's official competition platform DataFountain (referred to as DF platform) will gradually release the ideas of the winning teams for each competition topic. This scheme is the second prize-winning scheme of the [System Access Risk Identification] competition. The address of the competition is:

https://www.datafountain.cn/competitions/580 (click "Read the original text" at the bottom to reach directly)

Introduction to the winning team

Team name: LOL four missing one

Team members: The team members are four junior students from the 2020 Software School of South China University of Technology. During the school period, the members won several school and corporate scholarships, actively participated in laboratory activities, and actively participated in Huawei's wisdom plan and smart base project, laying a good foundation for programming.

Awards: Second Prize

Summary

In the current exploration process of IAM, the most practical method is the rule-based behavior analysis technology.

But its limitations are obvious. It is based on experience. It has the characteristics that it is better to kill one thousand by mistake than to let one go. It lacks data to prove whether someone is trying to steal or verify illegally obtained identity information, or is using stolen identity information, so as to carry out risk warning and disposal in advance.

In response to this problem, our team combined machine learning to establish a complete project process.

Key words

lgb model, feature engineering, misjudgment rate analysis

data analysis

variable name

business meaning

illustrate

id

Sample ID

_

user_name

username

If the variable is empty, it means that the log is generated before the user logs in to the system

department

user's department

_

ip_transform

Authentication IP

(after encryption)

One-to-one desensitization processing of real authentication IP and encrypted characters

device_num_tran

sform

Authentication device number

(after encryption)

One-to-one desensitization processing of real authentication device numbers and encrypted characters

browser_version

browser version

_

browser

browser

_

os_type

operating system type

_

os_version

operating system version

_

on_datetime

Certification Date Time

_

ip_type

IP type

_

http_status_code

HTTP type code

_

op_city

Certified City

_

log_system_trans

form

access system

(after encryption)

One-to-one desensitization processing of real access system and encrypted characters

url

visit URL

_

op_month

Certification month

_

is_risk

Is there a risk

1: risky; 0: no risk. Only train.csv data contains this field

The data set has a total of 16 features except for the id field, one of which is whether it is judged as a risk by the system, and the others are login information such as user name, ip number, network status, operation time, etc., so that user behavior and network feature information can be restored. Then we analyzed the composition and data distribution of fields in all features, and at the same time analyzed the relationship between each field, trying to restore the real scene of this system.

Because the ip types that can be logged in are all intranet, so we judge that this is a company's private system, which is not open to the outside world. The company is divided into 5 departments, accounting, engineering, sales, human resources, and others. Each department can only access the web pages corresponding to the content of its department, and cannot access across departments. For example, engineers cannot log in to the accounting website. The commuting time of various departments is roughly the same, but not completely consistent. For example, the working time of engineers is longer than that of other positions. Computers in each department are equipped with multiple operating systems, browsers, browser versions, etc. The company has set up branches in 12 major cities across the country, so the visits are relatively concentrated and the visit records are normal. Therefore, visits in foreign countries and unknown locations are usually judged as dangerous. Similarly, when trying to log in to the system, if it is triggered to obtain the login code or obtain the login type, the probability of being identified as a risk by the system will be higher.

feature engineering

2.1  Feature Disassembly

The operation time feature in the original data set contains a lot of information, so we split the operation time into information such as year, month, day, hour, and day of the week, and convert it into an int64 type for easy operation.

2.2  Feature extraction

We observed the data types in the data set and drew a scatter diagram of the data distribution, and removed the year features with the same value through the variance filtering method. At the same time, because the month features in the test set are different from those in the test set, the month features are removed.

2.3  Timing feature derivation

Because the users of the system are human beings, and human behavior characteristics have a strong correlation with time characteristics. Operating in abnormal timing is bound to be accompanied by certain risks. Therefore, we focus on dealing with temporal features. We first introduce the chinese_calendar library to process holidays, and at the same time count the login time intervals of the same users, and calculate the mean and standard deviation of the login intervals of each discrete variable according to the classification of discrete variables. Finally, because the hour is a periodic variable, it is inappropriate to simply use a numerical value to encode it, which will seriously mislead the training results of the model. Therefore, we added the hour sine and cosine into the feature and achieved good results.

2.4  Derivation of Group Statistical Features

In addition to the timing characteristics, we can also analyze from the perspective of the login link. If some accesses have the same characteristics as some system attack behaviors, the probability of their risk will also increase. Therefore, we take into account the number of visits and downloads of the same user in the previous minute. If there are too many visits in one minute, there may be a risk of data leakage. In addition, the previous visits of the statistical department and the entire system also have an impact on the overall visit trend, so we counted these visits and added them to the model.

2.5  LabelEncoder label encoding

Because the values ​​of many discrete variables in this question are character types, we use labelEncoder to encode the discrete variables, which is conducive to the prediction of the model and improves the accuracy of the model.

model construction

3.1  Dataset selection method

This time we used the 5-fold cross-validation method to divide the training set into five parts, and used StratifiedKFold stratified sampling, so that each sub-set maintains the same category ratio as the original data set, avoiding model deviation caused by uneven distribution of samples. Then each subset is used as a verification set, and the other four subsets are used as training sets to train five different models. The average of the classification accuracy of the final verification set of these five models is used as the performance index of the classifier under this K-CV, which can effectively avoid overfitting.

3.2  Model Selection

This time we used three models, XGBoost, LightGBM and CatBoost, because the training samples we deal with this time are in the field of structured data with less data, so these three models have great advantages. After the experiment, it was found that LightGBM performed better, so it was selected as the main model for optimization.

3.3  Parameter adjustment

First, adjust parameters such as num_leaves, lambda_l1, lambda_l2, learning_rate, max_depth according to the degree of fitting through manual parameter adjustment, and then search for the best parameters through grid search, and find the best parameters of the model.

3.4  Model Fusion

We tried the weighted fusion of multiple models and the fusion of a single model, and obtained new data results.

3.5 Combination of model and feature engineering - misjudgment rate correction

When evaluating the prediction effect of a classification model, two important concepts will appear: False Positive (False Positive): refers to the situation where the prediction is positive, but it is actually negative; False Negative (False Negative): refers to the situation where the prediction is negative, but it is actually positive. Clearly, the key to continuing to improve the predictive accuracy of our existing models is to reduce false positives and false negatives.

During the training process of the model, five-fold cross-validation will be used to evaluate the prediction effect of the model. For each sample in the training set, it will be selected as a sample in the validation set to be predicted by the model. By comparing the predicted results with the actual labels, false positives and false negatives can be identified.

Figure 1: Judgment demonstration diagram of misjudgment samples

The following is an analysis of the misjudgment rate of hour:

Figure 2: Sample misjudgment rate corresponding to different hour categories

Here we list the misjudgment rate curves corresponding to different categories classified by access time hour in the training set. It can be seen that in the 24 hours, the misjudgment rate is the lowest in the first hour, only 1.04%, and the highest is in the 12th hour, with 13.18%. The difference between the two is as high as 12.14%.

In the actual scenario, the difference in misjudgment rate is probably because the company has more emergencies during the peak hours of the 8th hour and the 12th hour, which easily lead to misjudgment by the model. The correction of the misjudgment rate is introduced so that the model has more feedback information on the risk judgment of the access records in a specific period.

Figure 3: Demonstration of the actual model training operation

In actual operation, we first train the model for the first time, and then use the model to predict the access risk probability of the training set samples, classify the prediction results by hour, and obtain the misjudgment rates of several categories, and use the obtained misjudgment rates of different categories as new features for the second training of the model, and finally obtain the prediction results of the test set.

apply effects

Figure 4: Score curves during model iterations

After adding multiple features including visits per minute and legal holidays, our A-list score increased from 0.928607 in the baseline to 0.937540. Later, the false positive rate correction method has greatly improved the prediction accuracy of the test set. The score of the A list has increased from 0.940143 to 0.945732, and the ranking has risen by more than 30. Finally, the results of the B list came out, and our results ranked second.

thank you

Thanks to the organizer for providing such a platform, let us meet a better self here. We would also like to thank Teacher Cai Yi and Brother Xie Jiayuan for their guidance. We deeply admire their profound knowledge, and we are very grateful for their careful guidance. Finally, I would also like to thank myself who has been working hard. The road is long and the road is long, and I will search up and down.

reference

[1] datafountain, system access risk identification data and evaluation, https://www.datafountain.cn/competitions/580/datasets

[2] Meng Q . LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 2018.n/

[3] Huang Bo's machine learning circle, XGBoost, LightGBM and CatBoost algorithm comparison and parameter adjustment, https://cloud.tencent.com/developer/article/1814287

Guess you like

Origin blog.csdn.net/DataFountain/article/details/131229782