Dry goods | China Unicom government and enterprise data operation system construction

0b8f74308fbf7d5f425081cdca23066e.png

The following content is compiled from the final defense report of the students in the compulsory course "Big Data System Foundation" of the Big Data Capacity Improvement Project.

539969ae51526c38c68dd140f6d313b2.png

We will introduce our project from the following aspects. First, the first part is demand analysis, then data extraction and processing, then sample definition and distribution, feature coarse screening and model selection, feature fine screening and scorecard modeling, TOAD scorecard construction and decision-making suggestions.

940604280e52279254a00e92d7ffdf88.png

The first is needs analysis. China Unicom has the following pain points in the customer rating scenario. First of all, the arrears of government and enterprise customers are relatively common. The proportion of their overdue customers is relatively high, and their account-to-receipt ratio is high. Moreover, China Unicom has no early warning for overdue customers, lack of expected risks, revenue scale, customers' own business risks, etc. Comprehensive evaluation and lack of customer ratings also lead to their inability to deploy customer service forces and resource allocation in accordance with scientific ratings. In addition, China Unicom's internal data is relatively chaotic and lacks a specific indicator. Therefore, the following specific requirements are put forward for us: First, the first requirement is to clean the data. Integration, the key point is to screen out the indicators we need; the second aspect is that we need to build such an effective customer rating model, we learn from the C-card model of financial risk control to build our scorecard system, because we need to base on The existing overdue situation is to predict whether they can repay the loan on time, and then we need to improve the accuracy of the model, because we need to work with some specific corporate customers, so we need to minimize the interruption of our users, so we We need to use both the possibility of customer overdue and the severity of customer overdue for model building; finally, we need to use the DWF platform to build a visual user rating system to promote business use, and give the business some business suggestions for quadrant analysis.

02acf0df6063e44d5e9751f4c24d9e7c.png

First of all, we extract and organize data. We mainly include two parts of data. The first part is about our business data. We mainly use a component entid to effectively integrate 54 tables of business information. Secondly, we compiled a wide table of past arrears data. The more important thing is the selection of target variables, which mainly include two variables. The first is the expected possibility. After communicating with the business, we believe that a single month's overdue involves many factors, so it cannot be easily defined as overdue customers, so we define customers who are overdue for more than two consecutive months as foreign customers. The second indicator is overdue severity. In order to plan the company's own seasonal consumption fluctuations, we calculate the proportion of the amount overdue in a single month to the amount overdue in a year as the severity of overdue. Based on the above definitions of the two target variables communicated with the business, specific logic calculations were performed, and SQL was used for final data extraction. Finally, we combine the industrial and commercial data with the past arrears data.

d132681564507e24533f2964392e6892.png

Next, we enter the interpretation of the two target variables. The overall environment of our project is a decision-making intelligence suggestion. The most important thing about decision-making intelligence is not the latter model, but the previous target definition is more important. As far as we are in the field of decision-making intelligence, as long as the target is defined, it is actually very simple to run through the following things. So first of all, the first one is the bad possibility, or the bad trend that the students just mentioned, the term is called M2plus. Just looking at a bad month, it is not bad. Based on their domain knowledge, they believe that there is a bad trend only if two consecutive months are bad. Based on this knowledge, we define a binary sample definition. During the performance period, if it is overdue for more than two consecutive months, it is a positive sample. Most people are still a good sample, of course, there will be slightly more positive samples than those in the financial field. Considering the time relationship, the instructor samples about one-thousandth of us in the system, which is the order of magnitude that we can run locally.

1d785cec9602d40c532663f5036a0e28.png

Then in addition to bad trends, we also need to look at the degree of bad customers. According to the indicators just mentioned, the denominator is to use the annualized payment amount to plan it into seasonal consumption fluctuations. Therefore, if a customer consumes a lot and is overdue at the same time, it may be bad. But in the end, another problem arises. Since it is a decision, we may not need a continuous variable. Its original data is a continuous value. We need to do a binary classification to facilitate decision-making. But the problem is, if the ratio is continuous, what kind of point is appropriate to choose? We plotted a correlation between a ratio and the previous M2plus. We use the axis step method to roughly judge the overdue ratio of a person and the overdue ratio of an enterprise is greater than or equal to a quarter or a third of this range. Judging by one-fourth, if a corporate customer has an expected ratio greater than or equal to 25% in this accounting period within this year, we judge it as a positive sample, and then the others are sub-samples. The ratio of positive and negative samples is about one to five, which is slightly higher than just now.

5838db8194b77d3f05ab4412f5322c91.png

After the sample is defined, in the field of intelligent decision-making, we pay attention to three important indicators, one is the accuracy or generalization ability of the model, the second is the stability of the model, and the third is the interpretability of the model sex. The difference from what the students just said about deep learning is that we are not pursuing its prediction accuracy, so you can see that the important indicator of our entire ten-fold cross-validation is actually the K value, which can be used in the financial field or this kind of The field of risk decision making distinguishes the difference between good and bad. Therefore, AUC is not the only indicator we judge. We compare the three models, and Lr is a linear model with strong interpretability. The latter two are black-box models. We want to do one thing here, that is Can a linearly interpretable model achieve the same effect as a black box model, but at the same time have a stronger explanatory power? Through ten-fold cross-validation, we found that there is indeed such a conclusion on both targets. Of course, the premise is that we have made a transformation to Lr. If we do not do WOE transformation, the effect will be very poor.

1e276dfcba23bd7fa2c45d62342fa891.png

After selecting the Lr model, we went further to look at the importance of different features between different models. The black box model is actually suitable for a rough screening of features. If both models are considered important, we must put them into the later construction of the scorecard. But if Random Forest thinks it is not important, but Lr thinks it is important, we may consider it later to see if it will be included.

7ad38f6cf38fb4e8a9730b45eaaaf630.png

Finally, we will enter the following scorecard model construction. If the scorecard remains unchanged, the first step is to coarsely screen variables based on their information content. After removing some variables that will not enter the final scorecard model, we will filter it mainly based on its iv value, which is its information value. iv is actually the correlation coefficient between the difference between positive and negative samples in each variable and the logarithmic ratio, mainly to measure the information prediction ability of the variable. When applied to the business of the scorecard, for some variables with an iv value greater than 1, it is generally considered that there is a possibility of information leakage, and these variables are generally made into additional rules, or based on some adjustments or analysis based on the business. Consider adding it later. Then, if the iv value is below 0.02, it is generally a variable that does not have much value for prediction, so it is generally considered to be a variable that is greater than 0.02 to 1 as a variable for coarse screening. The second step is to adjust the variable binning, because our final use is the scorecard, mainly for independent regression, so in the end we hope to realize the progressive and monotonous change based on the odds value between each binning. Ideally, the lower right corner is the ideal situation of the number of overdue and arrears in the past 12 months. Its red color represents the bad debt rate. This line is monotonous, and based on business explanations, it can also be explained that in December The more the observation period and the number of arrears, the higher the possibility of overdue arrears behind it, or the higher the rate may be. The business explanation is reasonable, and such a monotonous and linear binning result After entering the WOE encoding, there will be a better result. For categorical variables like the lower left corner, for example, if there is only an integer variable with a negative value to a positive value here, generally the iv value in the upper left corner and its binning result will be considered, and then manually Try some tweaked binning. For example, if the part of -1 is extracted alone, the iv value will increase, and it will still maintain a relatively ideal monotonous change. So in the end, the scorecard model will also keep the results of three bins, and after binning all other possible variables by analogy, we can enter the second variable screening behind us, which is based on the model A sieve of algorithms.

c1a71911d4640ef04e420a96b667a26b.png

In the comment card, stepwise regression is mainly used to screen the subset of variables in the forward and reverse selection, and then use the criteria of AIC and BIC to screen the final variables that can enter it. LASSO or ridge regression will also be considered. s Choice. The ultimate goal is to be able to keep between 8 and 15 variables in the scorecard. Then the words on the right are the ten-fold cross-validation of the scorecard modeling with the two target variables of m2plus and the rate after the bisection, and then in the process of building the model of the scorecard, logistic regression is mainly used. Ten-fold cross-validation has a better explanation for the overall performance and model stability. For example, on the left, you can see that the stability of the results basically running out of ten-fold is relatively high. The score card on the right with the proportion of overdue bills after two points will be slightly worse than the previous M2plus. On the whole, some important features of random forest and Lr linkage are also referred to in the process of variable screening. The final determined model has 11 in the score card of M2plus, and 7 of them are common features of the previous ones. Then 10 rate score cards will enter the final model, and there are some indicators such as psi to measure the stability of the variables in the comment card. After some actual verification, the variables in it are basically less than 0.01. is ideal. For the screening of the optimal model, ks bucket is mainly used. In the column of bad debt rate, if it is a difference between a group and a group, it is relatively large and shows a strict single increase, generally it is considered that this model is a comparison Ideally, it can be used as the optimal model.

The last word is the assignment of scores to the scorecard. Generally speaking, it is based on the decision-making and management levels of the business, their needs for business understanding and interpretation, and the subjective part of a human input after the basic ratio and scoring. Finally, the TOAD package will go through the artificial part, and then run out the score specified for each variable for each bin. A positive value means that the higher the score, the less likely the customer is overdue. If it is a bin with a negative value, it means that the customer who falls into this bin has a higher possibility of overdue. The characteristics of the scorecard are easy for business personnel to explain the different scores of different customers, understand the reasons behind these scores, and understand how he can improve the score to determine the possibility of overdue customers. Then the next step, which generally consumes time and resources in the actual business, is the report before and after the implementation of the scorecard, or the stability report and the evaluation combined with business experts or combined with long-term implementation and practice. For some iv values If the special variable is too high, it will be processed separately by customer grouping and so on. Due to the limitations of the conditions of our project, this part may not be reflected in our project.

Then the next step is to design the overall scorecard on the DWF platform. Our original intention is to quickly locate the status of each indicator by checking the customer's id. For example, we may query other conditions, such as a company that has been established for a long time, including the relative size of its registered capital. The standard company, its situation, and in the process of our communication with Mr. Huang Yun, we said that there may be more companies with state-owned backgrounds, and we have also joined the search of this industry, etc. This is the first thing we want to achieve. a function. Then the second is that after we inquire about this company or several types of companies, we can quickly give a scorecard indicator, including the total score and the corresponding scores on each indicator, and let customers interact with each other Know what's going on in each scorecard.

In this process, China Unicom's project made us realize that it is not only for China Unicom, but also to assess the risk of its customers' overdue, or what is its income. More importantly, behind the scenes, we believe that we can monitor the operation status of small and medium-sized enterprises, including some large enterprises, which is a better supplement to the monitoring of industry and commerce. Then the second is that there is a debate between technical logic and business logic in the process of our communication. From the perspective of our usual research, we tend to theoretically establish some scorecard indicators, but in the process, my teammates, including existing data research, may first have a theoretical framework to search for data. We start with the data first, then build a scorecard, and then move towards theoretical and practical significance.

c9a80cf8f7b92a9a319a8060c01d7ccc.png

On this basis, the last function we want to achieve is to give China Unicom and the government a decision-making suggestion, which is divided into two levels. The first level is whether the company itself has the possibility of overdue; the second is about The degree of overdue refers to the approximate amount that is overdue. This is the first indicator. The second indicator, we want to say whether the revenue capability of the enterprise itself is high or low for China Unicom. The second is to combine its expected risks and establish such a thinking index. We understand that we are not just completing an assignment, but may be more careful and responsible in the process of project communication including task completion.

Editor: Wen Jing

Proofreading: Liu Guangdong

4c38df246d1bb233a4a2cbd4501cd8fa.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131650871