[Second prize scheme] System access risk identification "QDU" team problem-solving ideas

The 10th CCF Big Data and Computational Intelligence Competition (2022 CCF BDCI) has been successfully concluded. The competition's official competition platform, DataFountain (referred to as DF platform, and the official name is collectively referred to as DataFountain or DataFountain Data Science) is gradually releasing the ideas of the winning teams for each competition.

This scheme is the second prize-winning scheme of the [System Access Risk Identification] competition.

Competition address: https://www.datafountain.cn/competitions/580

Introduction to the winning team

Team name: QDU

Team members: QDU is a team composed of teachers and students from Qingdao University who love data mining competitions. They have won CIKM AnalytiCup 2019 (championship), WSDM CUP 2022 (championship), KDD CUP 2020 (1st in preliminary competition, 6th in final), 8th in KDD CUP 2022 PaddlePaddle track, Ctrip Yunhai big data competition (championship), and the 2nd Rong360 Tianji Risk Control Competition (runner-up). excellent grades. Zhao Yingying, a third-year undergraduate student from the Department of Computer Science and Technology, Li Zhiruo, a third-year graduate student from the Department of Statistics, and Wu Shunyao, a teacher from the Department of Computer Science and Technology, participated in this competition.

Awards: Second Prize

Summary

As countries and enterprises pay more and more attention to system security issues, the importance of system access risk identification in the unified identity management system has become increasingly prominent. This competition requires the implementation of a systematic risk assessment process. However, there are many difficulties in this competition, such as few data samples and inconsistent distribution of online and offline data sets. Therefore, how to use the given data to realize the evaluation process is not a small challenge. Based on this, this paper proposes a system risk identification scheme based on user behavior pattern mining. First, analyze and explore user behavior patterns and extract relevant features. Furthermore, the data set of the first three months whose distribution is similar to the online test set is selected as the training set, and Lightgbm is used for training. Finally, the obtained prediction results are post-processed, and the label values ​​are corrected according to the rule that nighttime data has a high risk, and the final result is obtained. The prediction result obtained by using this evaluation process won the third place in the rematch with a score of 0.94862310.

Key words

System risk identification, user behavior pattern mining, unified identity management system

introduction

The unified identity management system allows users to access various information systems and IT services in the enterprise through their accounts, which is conducive to improving enterprise management efficiency [1]. However, with the upgrading of network security technology and the increasing variety of new attack methods, the unified identity management system is facing the problem that unknown threats are difficult to identify [2]. Commonly used security protection methods include code scanning verification, face recognition, fingerprint recognition, etc. In order to improve security, enterprises usually combine multiple protection methods for identity verification, but this greatly reduces user experience and identification efficiency. In recent years, in order to take into account the efficiency and security of identity recognition, user entity behavior analysis has become a research hotspot in the unified identity management system. User entity behavior analysis is a method for discovering security threats in enterprise infrastructure [3], which uses machine learning, algorithms, and statistical analysis to detect real-time network attacks [4]. Modeling user and entity behavior and identifying risk points based on machine learning methods can effectively solve the problem that unknown threats are difficult to detect and enhance enterprise network security protection capabilities [5,6].

This competition poses a challenge to the identity authentication problem of the unified identity management system. Participants need to build relevant machine learning, artificial intelligence and other models based on the user's historical system access logs and whether there are risk marks to further improve the security and identification efficiency of the unified identity management system. There are three main difficulties in this competition: first, the data sample size is small, and overfitting is prone to occur. How to extract useful information from limited samples is one of the difficulties; second, the data set has fewer variables and most of them are discrete variables.

To this end, this paper proposes a system risk identification scheme based on user behavior pattern mining. First of all, the user behavior pattern is deeply excavated, the characteristics of risky access are extracted, and the user behavior rules are analyzed. Furthermore, multiple types of features are constructed, including user basic login information, derived features based on authentication time, user login information of the day, user behavior regularity features, and cross features, and the mean square error test is used for feature selection. Finally, the data set with the same distribution as the online test set was selected for modeling, and the post-processed LightGBM model [7] was selected as the final model according to the experimental results. The team plan scored 0.9480 in the preliminary list and 0.9486 in the rematch list, ranking third.

User Behavior Pattern Mining

Identifying risky user behavior patterns is the basis for effectively identifying system access risks. This chapter will use user historical login data to dig deep into user behavior patterns and discover potential patterns or useful information to guide subsequent feature engineering.

Figure 1 is a pattern analysis of the user's basic login information. As shown in Figure 1(a), there are great hidden dangers in system access where the authentication city is "foreign" or "unknown", and the risk rate is much higher than that of other cities. Figure 1(b) shows the relationship between the login URL and the dependent variable, that is, there are great risks in logging into the two URLs "getLoginType" and "getVerifyCode". Figure 1(c) shows that abnormal http status (status code other than "200") is often accompanied by greater risk. Figure 1(d) shows that there are often risks in accessing the system at night (20:00~07:00).

(a) Risk rate analysis of certified cities

(b) Risk rate analysis of login URL

(c) Risk rate analysis of http state

(d) Hazard rate analysis at night or not

Figure 1: Schema analysis of a user's basic login information

Figure 2 analyzes the periodicity of the system access risk rate based on the heat map. This article counts the daily system access risk rate on a weekly (7-day) cycle. As shown in Figure 2, risk rates are generally high on both Saturdays and Sundays. The abnormal risk rate from Monday to Friday in the fourth week is because these five days are the Spring Festival holiday, and the high risk rate on Monday and Tuesday in the 13th week is because these two days are Qingming Festival. The risk rate on the Saturday and Sunday of the third week and the Saturday of the 12th week is low because these three days are holidays and normal work is required. To sum up, it is necessary to consider the impact of Saturdays, Sundays and holidays on the system access risk rate.

Figure 2: Heat map cycle analysis of system access risk rate

Figure 3 analyzes the periodicity of user login behavior based on the heat map. This article counts the number of logins of users in two time dimensions of day and hour. As shown in Figure 3(a), the user guojianping9672 frequently logs in during the hours of 8:00-9:00 on Wednesday and 13:00-16:00 on Thursday, and almost never logs in on weekends. Figure 3(b) shows that user baojianhua2916 often logs in to the system during the day from 8:00 to 18:00 on weekdays, and it is periodic.

(a) Cyclic analysis of user guojianping9672's login behavior heat map

(b) Periodic analysis of user baojianhua2916's login behavior heat map

Figure 3: Periodic analysis of heat map of user login behavior

(a) Risk rate analysis of browsers

(b) Hazard rate analysis after crossing browser and authentication city features

Figure 4: Comparison of risk rate analysis before and after characteristic crossover

Figure 4 shows the comparison of the risk rate before and after browser feature crossover. As shown in Figure 4(a), the risk rates of using different browsers to access the system are basically the same. However, after cross-combining the browser and the authentication city, as shown in Figure 4(b), it can be found that there are certain rules in the new features.

feature engineering

Through in-depth mining and analysis of user patterns, this paper extracts hundreds of user portrait features based on user history system access logs. Furthermore, the Mean Variance test [8,9] (Mean Variance test, MV test) is used for feature selection to identify features that are significantly associated with the dependent variable. Mv test is a simple and efficient independence test method without any distribution assumptions, which can test whether there is an association between a continuous variable and a discrete variable. Finally, 182 important features were selected for subsequent modeling. Figure 5 shows the top 10 important features sorted based on the Mv test statistics. In this paper, 182 features are divided into five categories, including user basic login information, derived features based on authentication time, user login information of the day, user behavior pattern features, and cross features.

(1) User basic login information. This type of feature is a binary feature obtained after processing the user's basic information based on the rules discovered by user behavior pattern mining. For example, based on Figure 1(a) to obtain whether the login city is "foreign" or "unknown", based on Figure 1(b) to obtain whether the network status code is "200", based on Figure 1(c) to obtain whether the login URL is "getLoginType" or "getVerifyCode", etc.

(2) Derived features based on authentication time. Considering that system access authentication time plays an important role in identifying risks, this paper generates a series of binary features based on authentication time. For example, whether the certification time is at night, whether the certification time is on the weekend, whether the certification time is a holiday, whether the certification time is during the rest period, whether the certification time is a working day, etc.

(3) The user's login information for the day. This article extracts the number of logins, IP conversion times, login times using a certain device, login times using a certain IP address, and the number of visits to a certain website for each user from 00:00 of the day to the current day. Such information can reflect the user's activity level and behavior pattern on that day.

(4) The regular characteristics of user behavior. Such features include temporal difference features and statistical features. Regarding the time difference feature, this paper extracts the time difference of the forward window and the time difference of the backward window respectively. In the time difference feature of windowing forward, the time difference between each user's previous two times, the time difference of the user's last login with a certain device, the time difference of the user's last login with a certain ip, the time difference of the user's last visit to a certain website, and some time difference statistics are respectively extracted. For example, the average, maximum, minimum, and standard deviation of the time difference between the user's last login to a certain device during the holiday period, and the average, maximum, minimum, and standard deviation of the time difference between the user's last visit to a website on weekends and other time differences. In the feature of the time difference of the next click (that is, the time difference of the next click), since there will be data traversal when calculating the time difference of the next click directly, the statistical data of the above time difference is only calculated for the training set, and then directly aggregated into the test set. Regarding the statistical characteristics of user logins, based on the training set data, this paper extracts the average number of logins per hour of users on holidays and non-holidays, the average number of logins per hour on each day of the week, and the average number of logins using a certain information per hour on each day of the week. Such characteristics can reflect the user's login habits at different time points to a certain extent, which is also useful for predicting whether a sample is at risk.

(5) Cross features. Such features include simple intersection features and statistical features. Simple intersection features directly combine the given two categories of features. As shown in Figure 4, the new feature "browser + certified city" obtained by combining the two features of browser and certified city contains more information. Statistical features are features that are combined together to obtain statistical features, for example, the average number of times a certain department visits a certain website, etc.

Figure 5 The top 10 important features sorted based on Mv test statistics

Modeling and Postprocessing

There are differences in the data distribution of the offline data set provided in this competition and the online test set. As shown in Figure 6, compared with all offline datasets, the distribution of the first three months dataset is more similar to the online test set. In addition, this paper also uses the method of adversarial verification, using the data of whether it is a test set as a label training model, and predicts all samples in the training set. The results show that the prediction accuracy in the first three months is the highest, which is more consistent with the distribution of the online test set. To this end, this paper selects the data of the first three months that is most similar to the distribution of the test set as the training set.

Figure 6 Comparison of the distribution of the feature browser_diff1_max in the entire training set, the first three months of data set and the test set

It can be seen from Figure 1(d) that there is a great risk in nighttime system access. To do this, the predictions were post-processed using this rule to correct the predictions to 1 (at risk) for all nighttime samples. The experimental results show that after the modification, both online and offline are improved. In addition, this paper also tried other post-processing methods. For example, if the prediction result of more than 5 downloads in a short period of time is corrected to 1, although there will be an increase offline, there will be a decline online.

Experimental results and analysis

Table 1 Comparison of experimental results

In order to evaluate the offline effect, this paper conducts a 50-fold cross-validation on the data set of the first three months, and selects models such as Lightgbm, Catboost, Xgboost, DeepFM and DCN. As shown in Table 1, Lightgbm has the best online and offline prediction results; post-processing can slightly improve the prediction results of all models. Therefore, choose Lightgbm for post-processing as the final solution.

Summary and Outlook

In order to meet the requirements of this competition, this paper proposes a system risk identification scheme based on user behavior pattern mining. First, on the basis of traditional rule analysis technology, dig deep into the rules of user behavior, and extract 182 features closely related to system access risks, including five categories: basic user login information, derived features based on authentication time, user login information on the day, user behavior regular features, and cross-features. Furthermore, by analyzing the distribution of the data set, it was observed that the data distribution of the first three months was more similar to the test set, so only the first three months were used as the training data of the training set to solve the problem of inconsistency in the distribution of the test set and the training set for the competition. Finally, according to the results in Table 1, it is found that the Lightgbm+ post-processing model performs the best, and the score is stable, so this model is selected for training, and finally a good prediction effect is obtained.

Although the scheme in this paper has achieved good results, there are some directions worthy of research and discussion. For example, try to extract time series such as user login sequence, and use convolutional neural network and recurrent neural network to model, which is expected to better capture user behavior characteristics.

thank you

Thanks to the organization of the 2022 CCF Big Data and Computational Intelligence Competition. Thanks to DataFountain platform and Zhuyun Technology Co., Ltd. for their help and explanation.

reference

[1] Zhang Siyu, Huang Baoqing, Jiang Kaida. Unified identity authentication log centralized management and account risk detection [J]. Journal of Southeast University (Natural Science Edition), 2017, 47(S1): 113-117.

[2] Cui Jingyang, Chen Zhenguo, Tian Liqin, Zhang Guanghua. A Survey of User and Entity Behavior Analysis Technology Based on Machine Learning [J]. Computer Engineering, 2022,48(02): 10-24. DOI: 10.19678/j.issn.1000-3428.0062623.

[3] Lukashin, A., Popov, M., Bolshakov, A., Nikolashin, Y. (2020). Scalable Data Processing Approach and Anomaly Detection Method for User and Entity Behavior Analytics Platform. In: Kotenko, I., Badica, C., Desnitsky, V., El Baz, D., Ivanovic, M. (eds) Intelligent Distributed Computing XIII. IDC 2019. Studies in Computational Intelligence, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-32258-8_40

[4] M. A. Salitin and A. H. Zolait, "The role of User Entity Behavior Analytics to detect network attacks in real time," 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), 2018, pp. 1-5, doi: 10.1109/3ICT.2018.88

[5] Mo Fan, He Shuai, Sun Jia, Fan Yuan, Liu Bo. Application of User Entity Behavior Analysis Technology Based on Machine Learning in Account Anomaly Detection [J]. Communication Technology, 2020,53(05): 1262-1267.

[6] M. Shashanka, M. -Y. Shen and J. Wang, "User and entity behavior analytics for enterprise security," 2016 IEEE International Conference on Big Data (Big Data), 2016, pp. 1867-1874, doi: 10.1109/BigData.2016.7840805.55782.

[7] Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boosting decision tree[J]. Advances in neural information processing systems, 2017, 30.

[8] Hengjian Cui, Runze Li, Wei Zhong. Model-Free Feature Screening For Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association. 2015, 110(510): 630-641.

[9] Hengjian Cui, Wei Zhong. A distribution-free test of independence based on mean variance index. Computational Statistics & Data Analysis. 2019, 139: 117-133.

—End—

Guess you like

Origin blog.csdn.net/DataFountain/article/details/131185034