When financial risk control meets artificial intelligence, Zhongan Finance's real-time feature platform practice

Guide: With the digital transformation and upgrading of enterprises, online business presents the characteristics of multi-scenario, multi-channel and diversification. Mining the value of data elements is a race against time, and the business also puts forward higher requirements for the timeliness and flexibility of data. In the context of huge, dispersed and highly concurrent data sources, real-time data processing capabilities have become a major factor for companies to enhance their competitiveness. What I am sharing today is the practice of Zhongan Finance’s real-time feature platform.

The following introduction is divided into four parts:

  1. Introduction to Zhongan Financial MLOps
  2. Real-time feature platform architecture design
  3. Detailed explanation of real-time business characteristic calculation
  4. Application of features in anti-fraud scenarios

Sharing Guest|Guo Yubo Zhongan Senior Technical Expert

Introduction to Zhongan Financial MLOps

What is MLOps

(1) Definition

MLOps is a set of method architectures that integrate machine learning, data engineering, and DevOps to achieve efficient iteration of machine learning models and continuous and stable application to production business. So it is a set of practical methodology and a set of architectural solutions.

(2) Collaborative team

① Data product team: Define business goals and measure business value. ② Data engineering team: collect business data, and then clean and transform the data. ③ Data scientist team: Build ML solutions and develop corresponding feature models. ④ Data application team: model application, continuous monitoring of features.

Zhongan Financial MLOps Process Description

(1) Sample preparation. The product business team defines the business scope, determines the modeling goals, and selects the sample population to prepare the training set. (2) Data processing requires data cleaning of missing values, outliers, error values, and data formats, and data conversion using continuous variables, discrete variables, and time series. (3) Feature development. After the data processing is completed, feature derivation can be performed. Financial features are mainly derived through approval logic, behavior summary quantification, exhaustive method, de-dimensionalization, binning, WoE, dimensionality reduction, One-Hot coding, etc. for feature derivation, and then feature screening based on feature quality, such as feature indicators (KS, IV, PSI) or expected overdue rate. (4) Model development involves selection of algorithms and fitting of models. (5) Model training, use the test data set to test and verify the model algorithm, and optimize the parameters. (6) Model application. After the model is developed, it needs to be deployed online. (7) Model monitoring requires continuous monitoring and iterative optimization of the model after it goes online.

Why does ZhongAn Finance need to build MLOps?

ZhongAn Finance, on the one hand, provides public credit support for inclusive people, and on the other hand, it also provides risk mitigation for banks and other financial institutions to help inclusive finance. ZhongAn Financial provides credit guarantee insurance services for unsecured pure online consumer loan platforms, and also provides credit guarantee insurance services for other financial institutions.

As an insurance company, ZhongAn Insurance is responsible for settlement of claims, so we are required to conduct comprehensive identification, accurate measurement, and strict monitoring of risks. We have built a big data risk control system based on big data, with risk control rules and models as strategies, and system platforms as tools. By using the association between big data and personal credit, a large number of user risk characteristics and risk models are mined to improve the predictive ability of risk control.

With the refinement of risk control strategies, the scale of model application, and the real-time use of features, we put forward faster and more real-time requirements for our feature development and model application, so we began to try to implement the systematization of feature platforms.

Zhongan Financial MLOps System

(1) Big data platform: Data development engineers use the capabilities of the big data platform to collect relevant business data, build an offline data system based on subject domains, and synchronize the corresponding data backflow to the online NoSQL storage engine for use by the real-time feature platform. (2) Feature engineering: data scientists construct feature engineering on the big data platform, and use offline data warehouses to mine and select features. (3) Machine learning platform: With the help of machine learning platform, data scientists can carry out one-stop model development and application. (4) Real-time feature platform: The developed features and models need to be registered on the real-time feature platform. After the relevant information is configured on the real-time feature platform, the data service capabilities of the real-time feature platform can be used to provide upstream business feature query and model application capabilities.

Real-time feature platform architecture design

Application Scenarios of ZhongAn Financial Features

ZhongAn Financial’s real-time feature platform serves the whole process of financial business, including core business scenarios on the financial line such as login, access, credit granting, withdrawal, withdrawal, and other real-time front-end scenarios. The back-end business scenarios are mostly feature call scenarios in batches. In addition, there are collections and the use of features and models. The original intention of the feature platform is to serve the risk control system. With the development of business, models are gradually used in user marketing scenarios and some resource-based user recommendation services. It is worth noting here that students who are familiar with the risk control business will know that a risk control strategy will have multiple risk control rules, and each risk control rule will query multiple feature data, so a business transaction may be magnified to hundreds of times for the real-time feature platform.

ZhongAn Financial Feature Data Classification

(1) Transaction behavior data: including business data such as credit extension, loan application, repayment data, amount adjustment and overdue data. (2) Three-party credit reporting data: It is necessary to interface with third-party credit reporting agencies. (3) Device capture data: Obtain device-related information when authorized by the user. (4) User behavior data: Obtained through user behavior tracking.

Real-time Feature Platform Core Capabilities

In the face of numerous feature data sources, this requires our real-time feature platform to have rich data access capabilities and real-time data processing capabilities. For a large number of feature requirements, the feature platform also requires efficient feature processing and configuration capabilities. Real-time business calls require the platform to have fast system response capabilities. We also carry out technology selection and architecture design for feature platforms based on the requirements of these core capabilities. After technical iterative selection, we adopted Flink as the real-time computing engine and Alibaba Cloud's TableStore as the high-performance storage engine, and realized the service and platform of the system through the micro-service architecture.

Business architecture of real-time feature platform

This picture is the business architecture diagram of our real-time feature platform. You can see that the bottom layer of the figure is the feature data source layer, the middle layer is the core functional layer of the real-time feature platform, and the upper layer is the business application layer of the entire feature platform system. The feature platform mainly has four data sources:

(1) Credit investigation data gateway: Provide user credit data of credit investigation institutions such as the People's Bank of China, and need to query credit investigation data through real-time interface docking. (2) Three-party data platform: Provide data from external data service providers, and complete real-time data docking by calling third-party data interface services. (3) Real-time computing platform: Real-time access to the transaction data of the business system, user behavior data and capture device data, and then flow back to the NoSQL online storage engine after real-time computing. (4) Offline call platform: Offline data is synchronized to NoSQL storage after calculation by Alibaba Cloud's MaxComputer to realize the return of historical data, thereby supporting the feature calculation of the user's full business time series. In addition, some non-real-time indicators also need to be returned to the NoSQL storage engine after the offline data warehouse is processed.

The core functions of the real-time feature platform:

(1) Feature gateway: The feature gateway is the entrance and exit of feature query, and has functions such as authenticating authority flow and feature data arrangement. (2) Feature configuration: In order to support the rapid launch of features, the feature platform realizes feature configuration capabilities, including three-party data feature configuration, real-time business feature configuration, mutual exclusion rule feature configuration, and model feature configuration capabilities. (3) Feature calculation: Feature calculation is realized through micro-service subsystems, mainly including three-party feature calculation, real-time feature calculation, anti-fraud feature calculation, and model feature calculation. (4) Feature management: The feature management background provides feature variable lifecycle management capabilities, model metadata management, and task management for feature batches. (5) Feature monitoring: It has the feature call full-link query capability, and provides real-time alarms for feature calculation failures, feature value outliers, and model PSI value fluctuations. In addition, it also provides statistical analysis of feature usage and other large-scale reports.

Three-party data real-time access solution

(1) Query method: call the real-time interface of the third-party credit reporting agency to obtain message data, and then process the data to obtain characteristic results. For the sake of cost reduction, we will also implement a caching mechanism to reduce the number of calls to the third party for offline scenarios. (2) Innovation point: the three-party data access engine can access the three-party data interface through pure configuration, realize automatic feature generation through the feature processing engine, provide configuration capabilities through the visual interface, and finally provide upstream through the interface to use the three-party feature calculation service. (3) Difficulties to solve: The difficulty of third-party data and configuration access is the diversity and complexity of encryption methods and signature mechanisms of data service providers. The third-party data access engine has built-in a set of encryption and decryption functions and the ability to support custom functions.

Real-time access to business data

The access to real-time business data consists of two parts. First, Flink monitors the Binlog data of the business database in real time and writes it into the real-time data warehouse. Another part uses Spark to complete the replenishment of historical data. Combining offline data and real-time data can support feature processing capabilities based on full time-series data. In order to support high-performance real-time feature queries, both real-time data and offline data will flow back to the NoSQL storage engine. For different data, we will also consider different storage engines. The business transaction data mainly uses TableStore as the storage engine, the user behavior characteristic data mainly uses Redis, and the user relationship map data is stored in a graph database. From the perspective of the entire process, the current data system adopts a mature Lambda architecture.

System architecture of real-time feature platform

The architecture of our real-time feature platform is basically as shown in the figure above. The upstream business performs feature query through the feature gateway. The feature gateway will verify the feature query authority, flow limit control and feature query tasks are distributed asynchronously. The feature gateway first routes to different feature data sources according to the feature metadata information. After querying the original data from the feature data source, it will be routed to different feature computing services for feature processing. The three-party feature is processed from the original message data queried by the three-party data platform and the credit gateway to the corresponding feature; the business feature calculation is used to process the real-time features of the internal business system.

The credit feature is processed and calculated through real-time business data synchronization, and the offline feature service is the feature processed and calculated in the offline data warehouse MaxComputer. Anti-fraud feature calculation is used to process related features such as user login device information, user behavior data, and user relationship graph.

After the calculation of the basic three-party features and business features is completed, they can be directly provided to the upstream business. In addition, the model service will also rely on these basic features. The model feature calculation is realized with the help of the capabilities of the machine learning platform. Our machine learning platform provides integrated functions such as model training, testing, and release. The feature platform integrates the capabilities of the machine learning platform to realize the automation and configuration of model features.

Detailed explanation of real-time business characteristic calculation

Feature real-time calculation solution selection

I think there are two real-time feature calculation solutions. The first one is to synchronize original business data in real time, and then realize feature processing in real-time computing tasks. This is the traditional ETL mode. The advantage of this method is that feature query is very efficient and has good query performance, but real-time task calculation is complex and requires a lot of real-time computing resources, and feature derivation is also difficult. Also more convenient, this newer ELT model. Due to the requirement of our business for frequent feature derivation and the consideration of saving real-time computing resources, we chose the second scheme of ELT's real-time feature processing.

Real-time business characteristic data flow

The real-time feature data flow realizes real-time data synchronization through Kafka+Flink, and also uses Spark to complete the collection of full time-series data from the offline data warehouse. The real-time business data mainly uses TableStore as the storage engine, combined with the real-time feature calculation engine and the multi-subject query capability of ID-Mapping to realize the configurable generation of features.

In addition to the data collected in real time by Flink, some data needs to be obtained by calling the interface of the business system. This data can also be registered as the metadata of the feature data engine, and it can be configured and used like the data stored in the TableStore. We use Alibaba Cloud's TableStore, a relatively stable high-speed query engine, to support real-time feature queries. However, the cost of cloud products also needs to be considered, so you also need to choose an appropriate solution based on your current situation.

Realtime Data Core Data Design

Since we have multiple product lines, each product line has different user primary keys, and the financial business scenario mainly uses user ID card, user mobile phone number and other dimensions to query features. Therefore, we abstract a set of ID-Mapping tables for user entity relationships, and realize the association relationship between ID card, mobile phone number and other dimensions and user primary keys. When feature querying, we first query the ID-Mapping table to obtain user IDs based on feature parameters, and then query user business details based on user IDs. The main business detail data includes user credit data, payment details, and repayment. User business data of details, quota details, and overdue details.

One of the pitfalls we have stepped on here is the scenario where the primary and secondary tables are updated at the same time. We store the primary and secondary tables as a piece of feature data. We mainly use the column family to store data. Therefore, in high concurrency scenarios, it may cause inconsistencies caused by simultaneous updates of the primary and secondary tables. We are now implementing data compensation through a window task.

The following figure is the main business data diagram:

Real-time feature calculation engine

Early feature processing was realized by developers writing code. With the increase of feature requirements, in order to support the rapid launch of features, we implemented a set of feature configuration capabilities based on feature calculation functions with the help of expression language and Groovy, combined with ID-Mapping to implement a feature calculation engine. The calculation process can be divided into the following steps:

(1) Create a real-time Flink task to synchronize user relationship data to the ID-Mapping table, thereby supporting user multidimensional data query. (2) Create a real-time Flink task to return user business data to Alibaba Cloud's TableStore to realize real-time synchronization of business detailed data. (3) On the real-time feature configuration page of the feature platform, register the user business data table that was synchronized to TableStore in the previous step as the logical data of the feature calculation engine. (4) Next, select the relevant feature metadata on the feature calculation configuration page, fill in the basic information of the feature, and the function of feature processing. After passing the test and going online, the feature can be used online. (5) When querying features, firstly, the ID-Mapping table will be entered according to the feature query to obtain the user ID, and then the user's detailed business data in the TableStore will be queried based on the user ID. The feature calculation engine will perform feature data query based on the configured feature calculation expression, and the calculated data result is the feature value. As mentioned in the fourth step, all the features under the feature group will be calculated.

Application of features in anti-fraud scenarios

Classification of anti-fraud features

As the risk of financial fraud continues to expand, the anti-fraud situation is becoming more and more severe. The feature platform also inevitably needs to support the query requirements of anti-fraud features. The following is our summary of the classification of anti-fraud features:

(1) User behavior features: Mainly based on buried user behavior data for feature derivation. For example, the number of APPs launched by the user, the duration of page visits, the number of clicks, and the number of inputs in an input box. (2) Location identification feature: It is mainly based on the user's real-time geographic location information, coupled with the GeoHash algorithm capability, to realize the data calculation of location features. (3) Device association features: It is mainly realized through the user relationship graph. By obtaining the information related to users under the same device, fraudulent behaviors such as wool parties can be quickly located. (4) User map relationship features: By obtaining real-time device information of users in key business scenarios such as login, registration, credit granting, and payment, combined with the three elements of the user and some of his contact information, the map relationship is constructed, and then risk identification is performed by querying the user's neighbor relationship and whether the user associated with the user has a black and gray list. (5) User community characteristics: By judging the size of the community and the performance of user behavior in the community, the characteristics of the community rules are extracted.

Real-time anti-fraud feature data stream

The data flow of anti-fraud feature calculation is similar to that of real-time feature calculation. Except that the data source comes from real-time business data, anti-fraud scenarios focus more on embedded user behavior data, captured user device data, and extracted user relationship data. User behavior data will be reported to Kafka through the embedded platform (XFlow), and these data will also be processed and calculated in real time using Flink. However, the difference from real-time business feature processing is that anti-fraud features are directly calculated in the real-time data warehouse and stored in storage such as Redis and graph databases. High-performance requirements for anti-fraud feature queries, and anti-fraud scenarios also pay more attention to real-time data changes. As can be seen from the figure above, the anti-fraud feature provides feature calculation services for the feature gateway through the HTTP API interface.

Relationship graph architecture diagram

The construction of the user relationship graph, the overall design idea is as follows:

(1) The first is the selection of the data source of the graph. In order to build a more valuable user relationship graph, it is necessary to find accurate data for graph modeling. The data source of the relationship graph mainly comes from user data, such as mobile phone number, ID card, device information, three elements of the user, contacts and other related data (2) The second is the selection of the graph data storage engine. What needs to be paid attention to is the stability of the engine, the real-time performance of the data, the convenience of integration, the high performance of the query, and the choice of the storage engine. The ability to process large amounts of data and the stability of the storage engine must go through a comprehensive technical research before carrying out production practice. (3) Secondly, it is necessary to consider the algorithm support capabilities related to graph data. In addition to the basic adjacent edge query capabilities, is there a richer graph algorithm support, such as the community discovery algorithm used in the anti-fraud scenario? As a graph database, it uses shared-nothing distributed storage, which can support the calculation of graphs of a large-scale model at the trillion level. Information about NebulaGraph can be obtained from their official website, so I won’t go into details here.

Anti-fraud graph feature extraction

Through the data mining of the user relationship graph by the model team, starting from multi-dimensional statistical data such as the age distribution of the user community and the distribution of consumption estimates, we have extracted some graph features. Here is a list for your reference:

(1) First-party fraud: It is believed that the same person has applied for multiple times through the picture, and the associated information such as the contact person submitted each time is not consistent. It can be considered that it is suspected of first-party fraud. (2) Suspected intermediary agency: some applicants are linked to the same mobile phone number of a contact. (3) Suspected fraudulent use of information: a person's mobile phone number may be used by many people, and information may have been leaked. (4) Suspected gang fraud: See if the number of community nodes in the relationship graph exceeds a certain scale.

Anti-fraud policy rules. It may not be possible to accurately locate anti-fraud behaviors through one or two features. It is necessary to combine multiple types of features to form anti-fraud policy rules, thereby improving the accuracy of user anti-fraud identification in many ways.

question Time

Q1: The source data of Flink is input by Kafka. Can the platform realize the correlation query between multiple Kafka messages?

A1: For the collection of the entire real-time business, we use Flink to complete the cleaning of detailed data, return the data of the DWD layer to TableStore, and then use the real-time feature calculation engine to realize data association, set multiple calculation factors, and then associate multiple pieces of data in this way to support the generation of final features. It is relatively rare to perform associated join queries of this data in Flink, which is the more popular ELT mode now.

Q2: For the feature real-time calculation solution, your company chooses solution 2. In the case of many variables, how to ensure the efficiency of interface response?

A2: The feature calculation is based on the feature group dimension. There may be tens or hundreds of features under a feature group. The main performance consumption of our current computing framework is the query of the original data of the feature. As long as the original data is queried, the calculation of the feature caliber is completed in the memory. Then it needs to be supported by an underlying high-performance query engine. We now rely on Alibaba Cloud's TableStore query engine to achieve fast data query capabilities.

Q3: I would like to ask you how to use unsupervised anomaly detection algorithms, such as isolated forest or LOF, how to deal with APP buried point behavior data, and how to extract features will be more effective?

A3: I am not good at the algorithm of the model. My understanding is that we can start with the characteristic quality and characteristic algorithm index, but there is no general solution. Only by verifying and tuning the algorithm according to the actual business data can we get the answer.

Q4: Is there any effective effect evaluation method for the community discovery algorithm of the relationship graph? Then what kind of community discovery algorithm is generally used here?

A4: Now the China Unicom component algorithm is used. Our relationship graph is not only used in the calculation of anti-fraud features, but also used by the anti-fraud team for anti-fraud investigations. They will conduct reverse verification of our relationship graph based on users who are suspected of fraud, and see the effect of the algorithm through actual research. We use Spark GraphX's connected classification algorithm. By finding the subgraph vertex ID, the connected component algorithm calculates the minimum vertex value connected to each vertex in each subgraph, and then uses the combination of all node IDs under the same vertex ID to generate a new ID as the community ID.

Q5: Your company relies heavily on graph performance to calculate anti-fraud variables in real time. Is there a bottleneck in current performance?

A5: It mainly depends on the amount of graph relationship data in the graph community. If you are querying ordinary users, there will not be too many nodes for the entire user, basically within ten nodes. But if this person is indeed an intermediary or a user suspected of anti-fraud, his subgraph will be very large, and the performance of the query will indeed time out. When applied to the anti-fraud scenario, we will set up a bottom-up plan. For example, if the response of the anti-fraud graph feature interface exceeds 100 milliseconds, we will let the user pass by default, and try not to affect the real-time service experience of the user.

Q6: What is the difference between anti-fraud features and indicators?

A6: The anti-fraud feature pays more attention to the behavior characteristics of users, and is more inclined to mine user behavior.

Q7: What is the data magnitude? What is the approximate execution time of the real-time feature task? How long is the offline computing task? May I introduce you?

A7: Now our daily feature query volume will be at the level of 80 to 90 million. The real-time task duration of real-time features depends on Flink's real-time capabilities, and the data synchronization can be completed within tens of milliseconds. Our feature query relies on the capabilities of Alibaba Cloud TableStore, and each feature query takes about 100 milliseconds, so the performance is relatively guaranteed.

Q8: After the Flink calculation is completed, may the real-time feature query be missing?

A8: It may be missing. Because it is now written in real time by monitoring the Binlog, if the amount of data in the Binlog may be very large during the peak business period, especially in some batch running business scenarios, then our entire real-time data collection may take longer than usual. For example, if the user withdraws immediately after the credit granting process, some of his credit granting data has not been written to the online query engine in a timely manner, and there may be missing cases in the real-time query this time.

Q9: Will there be an efficiency problem when calculating long-window features in the feature timing calculation scenario, and how to solve it?

A9: Our current storage is actually based on the user dimension. The primary key is the user ID of the user. Then we will query all the data related to the user through the range query of TableStore. In fact, in financial scenarios, unlike e-commerce or content query business scenarios, a user's transaction data and business data are not too much, and there will not be much efficiency problems.

Q10: How to define the dimensions of features? Will there be some uncommon data as dimensions? Such as coupon ID, etc.

A10: Coupon ID, etc. may have different dimensions and characteristics in marketing scenarios. In the risk control business scenario, the feature dimension is basically based on the person dimension, such as the user ID card, mobile phone number, and the dimension of the user component. It is relatively rare to see such features based on the coupon ID dimension.

Q11: Is there feature extraction for deep networks?

A11: There is no exploration in this area yet.

Q12: In practical applications, what is the maximum calculation time window of Flink, and is it more than 48 hours?

A12: This will be exceeded, and the maximum window range will be 3 days. But for real-time feature scenarios, the window is basically not too large, usually at the minute level. We can support some feature scenarios with low real-time requirements within a 3-day window. This is more resource-intensive, and this situation will be relatively rare.

Q13: How many real-time models are currently online? Then encountered any problems?

A13: Compared to the entire feature system, there are not many. However, our model covers the entire financial business field, and there are actually marketing scenarios, and the specific number may not be easy to disclose.

The main problem we encounter now is that the development of features and the actual online feature application are developed by two teams, and there will be some inconsistencies in the implementation between them. During model development, it relies on offline feature mining. During production, we use the coming and going of real-time features as the input parameters of the model, and the data will be different, which will have some impact on the PSI stability of the model.

Q14: How to solve the consistency between online and offline?

A14: This is a very big topic, and the popular flow-batch integration solution is also trying to solve this problem. From my understanding, first of all, the Lambda architecture may need some adjustments to realize the same offline real-time storage through the data lake. The engine aspect must be unified and the storage aspect must be unified. Of course, the cost will be relatively high. The last is the development of the feature caliber. The caliber during the feature mining development is directly applied to production, and the application is generated based on a unified caliber, so that it can be consistent with the previous feature development caliber.

Q15: If the real-time features cannot be obtained or the return is slow, what should the online model or decision engine do?

A15: These are some problems that we often encounter. For example, the model and the decision engine depend on the characteristics of the three parties, so what should we do if the three parties are not available. This situation depends on our dependence on this feature. If it is a strong dependence, then we may wait for this real-time feature to be successfully acquired, and then run the real model and strategy. If it is a weak dependency, then this situation will be considered during the development of the model, and other features or other methods will be used to deal with it. The same is true for the decision-making engine, and different decision-making rules can be customized to avoid this situation.

That's all for today's sharing, thank you all.


Thank you for reading this article (///▽///)

Welcome to GitHub to read the source code of NebulaGraph, or try to use it to solve your business problems yo~ GitHub address: https://github.com/vesoft-inc/nebula

RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4169309/blog/10085371