Hebei Happy Consumer Finance built a real-time data warehouse based on Apache Doris, and the query speed was increased by 400 times!

Guide to this article:

As the number of customers and the loan amount of Hebei Happy Consumer Finance continue to increase, how to rely on technologies such as big data and data analysis to provide better decision support, improve work efficiency and user experience has become an urgent problem to be solved. Based on this, the company decided to build a data middle platform, from the offline data warehouse based on TDH to the real-time data warehouse based on Apache Doris , and finally unified the data export, improved the data quality, and achieved a nearly 400-fold increase in query speed. This article will share with you the experience and application practice of Hebei Happy Consumer Finance Data Center in detail, hoping to bring some useful references to other companies.

Author Hebei Happy Consumer Financial Information Technology Department

Hebei Happiness Consumer Finance Co., Ltd. was initiated by Zhangjiakou Bank. It is the 22nd consumer finance company in China and the first consumer finance company in Hebei Province that officially opened in June 2017. It mainly provides inclusive, small and credit consumer loans with a maximum amount of no more than 200,000 yuan to individual customers. At present, the company's service area covers 32 provincial-level units across the country, and has successively been rated as a national high-tech enterprise recognized by the Ministry of Science and Technology, a small and medium-sized technological enterprise in Hebei Province, and a technological "little giant" enterprise in Shijiazhuang City.

As the number of customers and the amount of loans continue to increase, how to rely on technologies such as big data and data analysis to provide better decision-making support for personnel in various business lines, how to improve work efficiency, and provide customers with a better user experience have become urgent problems to be solved. The specific requirements are as follows:

  • Kanban for executives: Build a cockpit for executives to help senior management quickly understand the company's current overall operating conditions. The cockpit integrates data from all business lines, including real-time business indicators and offline business indicators. In this scenario, we hope that query results can be returned within milliseconds, which is convenient for management to make efficient decisions.
  • Real-time variables: In order to provide real-time support for risk control decisions, the query results of the entire product line need to be returned within 500ms, and high-dimensional derivative variables at the product level, customer level, and IOU level can be calculated based on basic information such as application, credit, related parties, overdue, and repayment.
  • Decision-making analysis: In order to provide business analysis and decision-making support for various business departments, it is necessary to return multi-dimensional thematic reports of the year, quarter, and month at the second level. In particular, the risk department needs to review the operation of the entire life cycle from the beginning of lending; while the financial department needs to predict future profits based on past operating data performance, with a large amount of data and complex logic.
  • Risk modeling category: provide full and detailed data, run batches of variables for risk modeling, meet the requirements of core business such as customer scoring and rating, participation in approval, credit granting, etc., to support business needs such as observation of business indicators, input-output analysis, variable screening, and decision-making.
  • Regulatory compliance: In order to ensure that the business complies with relevant laws, regulations, industry standards, etc., the compliance subsystem needs to be regularly reported according to regulatory compliance requirements. The reported data is divided into index summary data and detailed data.

In order to meet the needs of different business lines for data analysis, the company began to build a data center and optimize it. Initially, the company built an offline data warehouse based on the commercial product TDH to meet basic data analysis needs. However, with the improvement of data timeliness and demand for real-time analysis, the company urgently needs to build a real-time data warehouse. Therefore, Apache Doris was introduced and a real-time data warehouse was built on this basis, and an efficient and stable data center was finally established. This article will share with you the experience and application practice of Hebei Happy Consumer Finance Data Center in detail, hoping to bring some useful references to other companies.

Offline data warehouse based on TDH

Since the main solution in the early stage was offline analysis requirements, priority was given to building offline data warehouses based on TDH clusters. The upstream data is collected into the offline data warehouse through Sqoop and DataX, and after standardized data cleaning, the daily batch running of the data warehouse is completed. The offline data warehouse architecture diagram is as follows:

Hebei 1.png

With the accumulation of data and the increasing requirements of business personnel on data timeliness, some problems based on the TDH offline platform gradually emerged:

  • Slow timeliness of data acquisition synchronization: offline data warehouse extraction tools rely on components such as Sqoop and DataX, and are limited by the scheduling cycle, so the data collected by such tools must have a lag.
  • Large resource conflicts: Offline data warehouses have a large time span for running batches every day, usually from early morning to around 5:00 pm, which will lead to resource conflicts between batch running and ad hoc queries, affecting the experience of business personnel.
  • Slow query analysis: When using an offline platform for custom statistical analysis and data exploration, the response speed of query analysis is slow and timeliness is difficult to guarantee, seriously affecting work efficiency.
  • High T+1 delay: The demand for real-time data processing of various business lines is gradually increasing, and T+1 data can no longer meet the demands of fast data acquisition and business value generation.
  • Long report customization cycle: Report customization development has an inherent iterative cycle, which makes it difficult for business personnel to perform flexible and diverse analysis and exploration of data.
  • Chimney effect: When reporting data regularly, the data center needs to pull data from multiple business systems. When the business system is changed, it will affect the above-mentioned related reporting subsystems, forming a chimney effect.

Technology selection

In order to solve the above problems, we urgently need an MPP engine to build a real-time data warehouse. We have several basic requirements for the new engine: first, it needs to be easy to use so that the team can quickly master and use it; second, it needs to have powerful data import capabilities to quickly and efficiently import massive amounts of data; at the same time, it needs to be compatible with offline data warehouse related tools to seamlessly integrate with existing data processing tools and technical systems;

Driven by the above selection requirements, we conducted a systematic research on the currently popular ClickHouse and Doris, among which Apache Doris is more in line with our selection requirements, the specific reasons are as follows:

  • Low deployment cost: Doris adopts a distributed technology architecture, only two processes are required for deployment, it does not depend on other systems, online cluster expansion and contraction, automatic copy repair, and low deployment and use costs.
  • Get started quickly: Doris adopts the mainstream design idea of ​​partitioning and bucketing, and the index structure is similar to the idea of ​​MySQL. Relevant personnel do not need to learn a lot of new knowledge when using Doris. In contrast, ClickHouse needs to specify types separately when building databases and tables. The use process is relatively cumbersome and difficult to get started.
  • Tool Compatibility: Business personnel usually use TDH's client tool WaterDrop for offline data warehouse queries. Doris is perfectly compatible with WaterDrop through standard protocol links, but ClickHouse is not compatible.
  • Rich data ecosystem: Doris has a rich data ecosystem and is highly integrated with Flink, Kafka and other components. It also supports federated queries and provides a wealth of data import and access methods to meet data processing needs in multiple scenarios.
  • High concurrency capability: We conducted a performance stress test on Doris. In the case of high concurrency and large data volume, Doris showed good performance and stability, which can meet the needs of different business scenarios.
  • High community activity: The Doris community is very active, with a large number of developers and users participating, providing a wealth of technical support and solutions. At the same time, the Doris community provides comprehensive documents and materials to facilitate users to learn and use Doris. In addition, SelectDB provides the community with a full-time professional technical team to provide service and support for community users, and any problems can be quickly responded to.

Real-time data warehouse based on Doris

On the basis of the offline data warehouse, Doris is used in combination with CDH cluster and Airflow cluster to build a real-time data warehouse. The data sources of the real-time data warehouse are mainly offline data warehouse and MySQL. Flink CDC is used in combination with PyFlink (using Python to call Flink's API, referred to as PyFlink) to collect data in MySQL in real time into the core computing engine Doris (described in detail later). The upper layer is the Airflow distributed scheduling system, which can perform regular scheduling and operation of real-time tasks. We have layered the basic data warehouse for the Doris engine, and the data is processed by each layer to provide data services for each scene in a unified manner.

Hebei 2.png

Based on the rich import methods provided by Doris, we can quickly integrate the real-time data cleaning in the offline data warehouse into the Doris cluster to achieve rapid data migration. At present, we have transferred all TDH-based query analysis and data exploration services to the Doris engine. With the help of the Doris engine's fast calculation capability and excellent query performance, data processing and analysis can be performed more efficiently, and the business processing speed and efficiency have been significantly improved.

Take a certain SQL as an example, the SQL is mainly used in credit approval scenarios. We compared the query return speed of the original architecture and the new architecture from the three large tables of 100,000, 10,000,000, and 100,000,000. The results show that it takes 11 minutes and 30 seconds to return the results in the past TDH architecture , but it only takes 1.7 seconds to return the results in the new Doris-based architecture , which is nearly 400 times faster !

Data scale:

Hebei 3.png

Original offline data warehouse: it takes 11 minutes and 30 seconds to return the result

Hebei 4.png

New data warehouse based on Doris: After optimizing the query, it only takes 1.7 seconds to return the result, and sometimes it can even return within 1 second .

Hebei 5.png

application practice

Real-time data collection

The company's business system is usually divided into libraries according to products, and the table structure of each product is consistent. The core function of the real-time data warehouse is to rely on the rich import capabilities of Doris to collect the same logical tables corresponding to the scattered libraries into the same logical table under Doris. The collected data can also be adjusted as a whole at the level of regulatory topics to avoid the chimney effect. After the collected real-time data enters the data warehouse, it will actively trigger the automatic calculation of derived variables and update the values ​​of derived variables. The summary value of the derived variable is in a separate table, and when querying, it can achieve query response at the millisecond level.

When collecting real-time data warehouses, it is first necessary to determine the version numbers of core components such as FlinkCDC, Flink, Flink on Yarn, Apache Doris, etc., and then build a product-based automatic access to real-time data warehouses based on PyFlink. The specific operation is as follows:

  • At the data level, the business system database is segmented horizontally and vertically to improve read and write performance and increase high availability.
  • At the data warehouse level, we have aggregated the data in the business table in dimensions for better unified summary analysis.
  • In terms of data access, we need to efficiently access the stock data of existing business systems and continuously and stably access incremental data.

In addition, we also provide standardized access solutions and interfaces to meet the needs of different business scenarios.

Steps for usage:

  1. Access configuration table: configure related information of collected business database tables

Hebei 6.png

  1. Scheduling system deployment: Deploy real-time collected tasks through the scheduling system

Hebei 7.png

  1. Routine operation and maintenance of tasks: We have highly encapsulated the functions of task online, start, stop, and exception recovery processing, and deeply integrated and integrated with the distributed scheduling system Airflow. Users don't need to care about the underlying details, and can easily migrate MySQL tables to Doris with one click, realizing automatic migration of stock and incremental data. After communication, the community has released version 1.4.0 of Doris-Flink-Connector , which integrates Flink CDC and can realize one-click synchronization of the entire database from relational databases such as MySQL to Apache Doris.

Data Quality Monitoring

There are various data quality problems in the offline data warehouse. These problems are usually exposed when the data is run in batches, resulting in a sharp compression of the data repair time window. In order to solve this problem, we use Doris to establish a data quality monitoring system, and at the same time migrate the data quality monitoring model of the offline data warehouse to Doris. Based on this system, business indicators and data quality can be monitored in real time, and manual intervention or alarm can be performed in time when problems are found, so as to improve the stability and efficiency of running batches in offline data warehouses. In addition, after the real-time data warehouse obtains the collected data, the data quality can be checked in real time through the verification rules of the data quality monitoring system to ensure the accuracy of data collection.

Hebei 8.png

At present, we have migrated 30% of the data monitoring indicators and 35 business indicators to the Doris real-time cluster, successfully avoiding problems more than 3 times a month, and effectively improving the data quality of offline data warehouse running batches. In the future, we will continue to migrate more data monitoring indicators and business indicators to the Doris cluster to further improve the efficiency and quality of data processing.

Data federation query

The core data of each business line is stored in different types of databases, such as MySQL, Hive, ES, etc. The Multi Catalog function provided by Apache Doris version 1.2 can unify data query export and realize federated query, which provides great convenience for data analysis. At the same time, with the help of Doris's persistence capability, data from other data sources can be quickly synchronized through the appearance, which is convenient and quick. In addition, with the support of Apache Doris aggregation query, vectorization engine and other technologies, we have truly realized an efficient unified data portal and improved the efficiency of data analysis .

Optimize experience

load balancing

As the volume of services accessed by Doris continues to increase, the load on FE is also increasing. In order to achieve high availability of Doris, we increased the number of FE nodes and deployed load balancing layers on multiple FE nodes. We choose to build FE load balancing based on Nginx TCP reverse proxy, which effectively realizes load balancing between FE roles. The specific configuration method is as follows:

Hebei 9.png

query optimization

The business data of the current approval system is persisted in the relational database MySQL, and the cumulative total input volume is nearly 280 million. In order to deal with the ever-increasing data problem, the database of the business system adopts the design idea of ​​sub-table and data archiving. But in terms of business, we still need to perform business queries on the full amount of data, and the timeliness requires that the results be returned within 3 seconds. The following is an abstract taxonomy of query requirements:

  • Use one or more of "application number", "customer number", "ID card number" and "core customer number" as query conditions for query
  • Use one of the conditions of "application date" or "update date", combined with "name", "application type", "incoming channel", "white list channel", "decision-making stage", "approval type", "approval result", etc. to form a review condition for query
  • Use any one of the "application date" or "update date" as a condition to query the detailed approval data of the past week

In order to meet the requirements of the above query scenarios, we combined the approval incoming data with the Doris engine's partitioning and bucketing technology, Bloom filter and bitmap index for a reasonable design, and finally realized the query efficiency requirements within 3 seconds in the business as a whole .

Optimization Strategy:

Partition: apply_time

分桶:ID、database_name、table_name

布隆索引:id_number, bhb_customer_id, customer_name, customer_id, serial_no

位图索引:apply_source,white_channel,approval_result,approval_status,product_type,decision_stage

The results of the pressure measurement indicators based on the above query are as follows:

Hebei 10.png

Operation and maintenance management

Hebei 11.png

Through Prometheus and Grafana provided by Doris, you can quickly obtain the overall health status of the Doris cluster and various indicator values ​​of each role. At the same time, we will also integrate the monitoring platform with the company's unified alarm platform for the second time. The alarm platform can obtain the basic index value of Prometheus through the API and compare it with the threshold, so as to trigger different levels of alarms or achieve automatic service restart. In addition, we realize the automatic operation and maintenance of tasks at the FE and BE service levels to ensure that it can be automatically pulled up when the service is abnormal, and the availability of core services is guaranteed.

Summing up benefits

Doris has been widely used within the company, and dozens of clusters have been built so far, bringing the following benefits to the company:

  • Improvement of data processing timeliness: The timeliness of data processing is from T+1 to real-time, which solves the problem of offline data delay.
  • Second-level query response: With the help of Doris partition and bucketing, materialized view, Bloom index and other functions for query optimization, the speed of ad hoc query is reduced from about 20 minutes to minutes or even seconds, which is nearly 400 times faster than before .
  • Unified query export: Relying on the powerful import capability and Multi Catalog function of Doris, the data of various business databases can be successfully integrated into Doris, and Doris provides unified data query and analysis services, which greatly improves the efficiency of query analysis response.
  • Improve data quality: Based on Doris, a data quality monitoring system has been established. At present, we have migrated 30% of data monitoring indicators and 35 business indicators to Doris real-time clusters, effectively improving the data quality of offline data warehouse running batches.

To sum up, the wide application of Doris in the company has brought us multiple benefits, helping enterprises improve data analysis efficiency, reduce data management costs, realize unified, real-time, and efficient data center construction, and inject new impetus into the good development of business.

In the future, we will continue to expand the use of Doris and work in business areas with higher real-time, performance, and timeliness requirements. Secondly, we will try to use more functions and new features of Doris. On the one hand, we will deepen the use of Doris in the company.

RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10089531