The latest risk control model competition has begun, financial big data application-corporate credit risk prevention and control-China Construction Bank data set

Dear friends, the latest financial risk control model competition has begun! The name of the competition is Financial Big Data Application: Enterprise Credit Risk Prevention and Control; Organization: Digital China Construction Summit Organizing Committee; China Construction Bank provides the model competition data set.

The prize money of this model competition is very high, the total prize money is 1.6 million yuan, and the first prize is 80,000 yuan.

Background of the competition:
  The digital transformation of financial institutions is in full swing. As one of the important means of digital transformation, artificial intelligence fully integrates and empowers the business fields and scenario applications of the financial industry. At present, artificial intelligence technology has penetrated into the five major business chains of the financial industry, such as financial product design, marketing, risk control, customer service and other support activities, especially with technologies such as biometric identification, machine learning, computer vision, and knowledge graphs. The financial industry under empowerment has derived many typical scenarios of financial artificial intelligence such as intelligent marketing, intelligent identification, and intelligent customer service.

 Competition task
  1. Combining financial data with government data, you can bring your own industry data to enrich the model dimension. Carry out creative design from the aspects of demand analysis, scene design, solution, landing verification, and product value, and submit creative solutions.
  2. Corporate credit risk prevention and control plan. Combining enterprise data and public data, establish an enterprise credit risk analysis model. The scenario direction can evaluate the risks of all aspects of the enterprise from the aspects of access management, early warning monitoring, credit adjustment, post-loan management, etc., and design a complete risk prevention and control plan based on the model and business scenarios to improve the bank's credit risk prevention and control capabilities.

Entry rules

▶▶ Participants: The competition is open to all walks of life, regardless of age, nationality, colleges and universities, scientific research institutes, and business practitioners can log on to the official website to register for the competition. Employees of relevant units participating in the organization of the competition can participate but cannot win awards;
▶▶ Registration requirements: each person can only participate in one team (1-5 people) for each competition problem, and can choose multiple competition problems to participate in the competition at the same time, different competition problems You can have different teams. When registering, all members need to provide basic personal information and pass real-name authentication; you must complete the team formation before the deadline for team formation, and once you form a team, you cannot leave the team. Team formation conditions: the total number of submissions by each member ≤ the number of days to start the competition*3, and a team must have at least one Chinese player; for more competition rules, please visit the homepage of the official website.

the data shows

Teacher Toby also downloaded the model competition data this time, observing that the variables in this data set are open and transparent. This means that this model competition is very meaningful, and we can find out valuable variables and business meanings through data mining and modeling methods.

The picture below is the data set provided by China Construction Bank, with a total of 47 variables and 120,000 customer data, and the data volume is relatively large.

  The corporate credit risk prevention and control data mainly include corporate business information, legal person basic information, provident fund payment and other financial data and government affairs data (both simulated data), which are included in the data dictionary. Among them, the training set and the test set can be divided by the players themselves according to the actual situation of the plan, and the players can also bring their own data to enrich the dimension of the plan.

submit request

  Participants are required to provide a solution design specification (in PPT, WORD or PDF format) and a result model in the preliminary stage. The content should include but not be limited to:
  background analysis——with practical significance, facing the actual problems faced by the financial industry, and analyzing the current business situation, current pain points, and difficulties in light of specific situations; implementation plan——based on the background analysis, design digital scenarios and
  propose Implementation of innovative ideas that can solve problems, reduce financial risks, and improve customer experience;
  data analysis—analyze the selection and use of data, including data cleaning process, field screening, importance analysis, etc.;
  algorithm analysis——analyze the construction The specific algorithm used by the model is analyzed and introduced, including the reason for algorithm selection, the parameter adjustment process, etc.; the
  value of the work-reflects the actual landing value of the work, which is reflected by measurable indicators.

submit example

  The solution design specification can be in PPT, WORD or PDF format, and the file name is based on "competition title + team name + scheme name".
  If there are audio-visual, data, model and other files, please pack them in the same folder and submit them after compression.

Evaluation Standard

  The organizer of the competition shall set up a jury of experts to be responsible for the evaluation of the competition.
  According to the ratio of each indicator and the corresponding reference description, the expert judging panel of the competition will score the entries of the contestants in a percentage system. The evaluation criteria are tentatively proposed as follows, for reference only, adjusted according to the actual organization of the competition, and the actual evaluation standards shall prevail.

Scoring Dimension Work Maturity Technical Level Application Potential Respondent Performance Accounted for 40% 30% 20% 10%

  ● Work maturity (40%)
  (1) Demand analysis (10%): problems with strong social significance and actual needs of the financial industry, based on data processing and analysis, combined with real conditions, effectively grasp the pain points, difficulties, and blockages of demand (2) Scenario design (10%): Design digital scenarios based on demand analysis, and
  propose innovative ideas that can solve problems, reduce social costs, and improve efficiency; (3) Solutions (10%): Scenario design based on competition
  questions , put forward a practical solution that meets financial needs, and form a relatively complete analysis report or comprehensive plan;
  (4) Data usage (10%): There is a clear data list for the data required for system construction, which may include data categories, Data format, data function, data source and other information, and make a clear plan for the data use process.
  ● Technical level (30%)
  (1) Advancement (5%): effective use of cloud computing, big data, artificial intelligence and other technologies, and technical capabilities leading the market and existing applications, with technological advancement; (2) Innovation
  ( 20%): The ideas and solutions for solving problems are highly innovative, and have obvious differences and upgrades from traditional methods; (3
  ) Maturity (5%): The solutions go deep into the needs of the industry, can effectively solve the pain points of the industry, and Forecast the risks that may be encountered in the future implementation process, and propose corresponding plans.
  ● Application potential (20%)
  (1) Practicality (5%): the work plan meets the needs of actual use scenarios, can be applied on the ground, and solve real business problems;
  (2) universality (5%): the work plan has a strong (3) Social benefits
  (5%): After the practical application of the work plan, it can generate greater social benefits and effectively help benefit the people, promote business, and optimize governance;
  (4) Commercial value (5%): The work proposal can be applied with high efficiency and low cost, and has strong commercial value and promotion potential.
  ● Response performance (10%)
  During the defense, the demeanor is decent, the language expression logic is clear, and the experts' questions are answered reasonably, showing relatively rich experience and professional ability.

Mr. Toby pointed out that this model competition is very open. Instead of ranking with a single indicator (accuracy/AUC/F1 score), it inspected the participants in multiple aspects. In my previous article "Sichuan College Student Financial Technology Modeling Contest - Model Reproduction and Comments", I put forward suggestions for the sponsor to improve, as shown in the figure below.

It seems that the organizer has read this article, and this project has indeed made up for the previous shortcomings, and it can be called a classic. This competition is very classic. It is recommended that everyone participate in it to improve their modeling ability.

unboxing test

After downloading the data, Mr. Toby opened the box for testing, first drew the variable histogram and correlation heat map, and asked for directions.

Mr. Toby suggested that everyone should not rush to model, but observe the data distribution characteristics first, which is similar to the intelligence collection work before the attack.

Through descriptive statistics, Mr. Toby found that this data set requires a lot of preprocessing work, which is difficult for beginners. For example, there are many time variables, which can be derived variables.

This data set is mixed with wrong data. If the contestants do not have automated detection tools, it is difficult to find the pits buried inside. For example, there is 2999 data in the loan application time, and 3019 data in the date of birth of the legal person. What is this stuff?

Dirty data means that the employees of China Construction Bank's complex database did not work carefully, or deliberately entered several wrong data. Dirty data is normal, because the amount of data is too large, we often encounter it.

Teacher Toby's first modeling observation, the performance of the model was solved perfectly. If it was an inexperienced player, he would have passed out with joy. For us, the quality of the model is suspicious, and we have to carefully check the business significance of the variables.

As expected, the variable presents a data leakage risk. As for what is data leakage, please read my previous article "Data Leakage - How to Demystify Machine Learning Models to Cheating".

After multiple rounds of variable screening, Mr. Toby deleted the variables suspected of data leakage, and finally used 34 variables to model the model. The performance of the model is very good. Through modeling, I found that there are multiple strong variables in the dataset. China Construction Bank has these strong variables, and its risk control capabilities will be very good.

Among the 34 variables modeled by Mr. Toby, there are very few highly correlated variables. If it is more stringent, these 34 variables can continue to be screened. This model uses 10-20 variables, which is enough to achieve excellent performance.

As shown in the figure below, the AUC of the model is 0.98. Of course, I can do higher. This belongs to the previous rounds of test data, and the methods of improving model performance such as model tuning have not been used yet.

Some variables are relatively low in importance, but more important in business significance. I still recommend keeping them and continuing to collect more data to observe the experiment. We can't just look at the statistical results in modeling, but also respect the business meaning. Data modeling and business significance are similar to the yin and yang of Tai Chi. The two are indispensable, and the best effect can only be achieved by balancing each other.

Summarize

Financial big data application - enterprise credit risk prevention and control model competition is a very good competition! Encourage everyone to participate. If you want to learn risk control modeling methods and codes, you can follow Mr. Toby's self-developed course "Python Financial Risk Control Scorecard Model and Data Analysis". The tutorials include introductions and codes of common algorithms such as logistic regression, ensemble trees, and neural networks. There are a large number of practical cases, and the model has superior performance. It is suitable for papers, assignments, patents, model competitions, and enterprise models. Everyone is welcome to collect it for work and study.

If you have a friend who needs to customize the model competition, you can leave a message to the blogger.

Case Study of Paper Reproduction Machine Learning Model

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/129496108