Cigna Life Insurance Unifies OLAP Technology Stack Practice Based on Apache Doris

Introduction to this article:

Currently, technological applications such as big data, artificial intelligence, and cloud computing are promoting the development of insurance technology and accelerating the digitalization process of the insurance industry. Against this background, Cigna continues to explore how to integrate and expand multiple data to empower agents to grasp more detailed user clues, and integrate intelligent analysis throughout the entire business chain to achieve a comprehensive understanding of users, products, and scenario strategies. Insights and closed-loop iteration. This article will introduce in detail Cigna’s journey of exploration in big data infrastructure, from the initial OLAP engine that provided services for online reports and ad-hoc analysis to a unified real-time data warehouse built on Apache Doris. Through a set of The architecture enables real-time analysis and integrated unified management of multiple data in various business fields , and ultimately achieves the goal of reducing costs and increasing revenue in front-line insurance businesses.

Author: Cigna Big Data Platform R&D Team


China Merchants Cigna Life Insurance Co., Ltd. is a Sino-foreign joint venture life insurance company between China Merchants Bank and Cigna Group. It provides companies and individuals with products and services covering insurance protection, health management, wealth planning and other products. At present, Cigna has served more than 10 million customers and completed claims settlements for more than 1 million customers. With its one-stop convenient health management services and flexible configuration of "customized" insurance plans, it has won the continued choice and trust of users.

Faced with the trend of explosive growth in global data volume, the timeliness and accuracy of data are increasingly important for enterprises to refine their operations. We hope to use data to quickly perceive customer behavior, locate customer problems, and efficiently match the products and services that users need, so as to achieve goals such as refined business marketing and broadening the insurable boundary.

As business continues to expand and analysis scenarios become increasingly diversified, the requirements of business analysts have become more complex. Not only do data warehouses need to be able to quickly develop data reports, but they also need to realize the integration of streaming batches, lake warehouses, and diversified data types. Unify analysis and management. In the construction of big data infrastructure, these integrated and unified features become crucial. In this context, the data warehouse architecture continues to be upgraded and improved, from the initial first-generation architecture that only supports BI reports and large data screens to the second-generation architecture that uses multiple systems and components to provide data services, to today's new generation of unified real-time data The warehouse uses a set of Apache Doris components to achieve simplification of the architecture, unification of the technology stack, and unified management and analysis of data , which not only improves data processing efficiency, but also meets more diverse data analysis needs.

This article will introduce in detail how Cigna unified storage, calculation and query exports based on Apache Doris during the iteration and upgrade process of the data warehouse architecture, how to meet the requirements of write timeliness, and how to perform high-concurrency enumeration and multi-table correlation in scenarios such as Achieve extremely fast query performance, provide assistance for efficient writing and querying of sales leads, high-frequency updates of customer retention information, consistent access to service scenario data, etc., further transforming customer leads into private domain business opportunities, empowering enterprises in operations, services, marketing, etc. Versatile capabilities.

Architecture 1.0: Multi-component quasi-real-time data warehouse

The initial business requirement is to use the data warehouse to host three types of business scenarios: self-service policy inquiries for C-end users, multi-dimensional analysis reports for business analysts, and real-time data dashboards for managers. The data warehouse needs to provide unified storage of business data and efficient query capabilities to support efficient business analysis and decision-making. It also needs to support data writeback to achieve closed-loop business operations.

  • Self-service policy query: Users can self-service query the insurance contract based on the policy ID through the Cigna Merchants APP, or conduct customized filtering queries through different dimensions (such as coverage time, insurance category, claim amount) to view information within the policy life cycle.
  • Multi-dimensional report analysis: Based on business needs, business analysts develop detailed data and indicator dimension reports to gain business insights into policy product innovation, rates, anti-claims fraud, etc., and support business strategy adjustments accordingly.
  • Dashboard: mainly used for real-time large screens of a certain bank channel and a certain branch. Through the unified aggregation of indicators and other data, popular insurance types, daily sales, total payment and proportion of insurance types, insurance over the years Information such as the payment increase trend is displayed on the real-time large screen.

In the early stage of the business, the requirements for data services were relatively simple, mainly to improve the timeliness of report data. Therefore, in the process of building the data warehouse, we adopted the typical Lambda architecture to collect data through two links, real-time and offline. , computing and storage, among which the data warehouse is mainly designed using a wide table model to support query and analysis of indicator data and detailed data.

Cigna Investment 1.png

As you can see from the architecture diagram, FlinkCDC is responsible for real-time data collection, and our self-developed Hisen tools (including Sqoop, DataX and Python) are responsible for offline data collection. After the original data is collected, real-time data is calculated using Flink, and offline data is handed over to Hive for batch processing. Finally, it is imported into different OLAP components (including Presto, Clickhouse, HBase, and MySQL). OLAP provides data services to the upper-level business. Among them, Each component plays a different role in the architecture:

MySQL

According to business needs, it is mainly used to store indicator data after the data is calculated. At present, the data volume of data warehouse tables has exceeded tens of millions, but MySQL storage has limitations and is prone to problems such as long execution times and system return errors.

Clickhouse

Clickhouse performs well in single-table data reading performance, but has weak performance in joining large tables. As business scenarios increase and the amount of real-time data continues to be superimposed and updated, Clickhouse has certain limitations in facing new business needs:

  • In order to reduce repeated calculation of indicators, it is necessary to introduce a star schema for multi-table association and high-concurrency query, but Clickhouse cannot support it;
  • When the policy content changes, data needs to be updated and written in real time. However, Clickhouse lacks support for real-time transactions. When facing data changes, wide tables need to be regenerated to overwrite old data. There are certain deficiencies in data update timeliness requirements;

HBase

Mainly used for primary key query, reading basic user status data from MySQL and Hive, including customer points, insurance time, and accumulated insurance amount. Since HBase does not support secondary indexes, reading non-primary key data is limited and cannot meet related query scenarios. At the same time, HBase does not support SQL statement queries.

Presto

Due to the scenario limitations of the above components in data query, we also introduced Presto as a query engine for offline data to conduct interactive analysis with data in Hive and provide reporting services for the upstream end.

After the launch of Data Warehouse 1.0, it has been used in more than 10 branches and a large number of large data screens and BI reports have been developed. With the continuous expansion of business scope, scenarios such as marketing, operations, and customer service have put forward higher requirements for data writing and query performance. However, the version 1.0 architecture that uses four components to provide data services has some challenges in actual business. . In order to avoid problems such as increased operation and maintenance costs and increased learning costs for R&D personnel due to too many architectural components, and to ensure the consistency of multi-source data in offline and real-time links, we decided to embark on an iterative journey of architecture updates.

Component requirements and system selection

To meet business needs, we need to "offload" the architecture and shorten the data processing process as much as possible. The 1.0 architecture will inevitably reduce the performance and timeliness of data storage and analysis due to problems such as too many components and link redundancy. Therefore, we hope to find an OLAP system that can cover most business scenarios, reduce the development, operation and maintenance and usage costs caused by complex technology stacks, and maximize the architecture performance. Specific requirements are as follows:

  • Import performance: It has the ability of real-time writing and real-time updating, and supports high-throughput writing of massive data.
  • Query performance: Provides query services for dimensional data and transaction data, with high-performance and real-time query capabilities for massive data.
  • Flexible multi-dimensional analysis and self-service query capabilities: Not only can it support primary key indexes to provide point-and-range queries, but it can also support multi-dimensional retrieval analysis, provide table correlation queries for billions of data, and realize flexible and dynamic, drill-down and roll-up services. data analysis.
  • Simplification of the data platform architecture: A component with strong comprehensive capabilities is needed to replace the current redundant architecture to meet the needs of reading and writing real-time and offline data, high query performance in different scenarios, and simple and easy-to-use SQL statement queries.

Based on this, we started system selection, compared popular components on the market with existing architectures in many aspects, and evaluated whether they met the business requirements for components. Finally, we locked Apache Doris among many OLAPs. The specific reasons are as follows:

  • Supports low-latency real-time writing: supports FlinkCDC's high-throughput writing under massive data, providing real-time data external services; supports primary key table model write-time merging, realizing micro-batch high-frequency real-time writing; supports Upsert and Insert Overwrite to ensure high efficiency data update.
  • Ensure data is consistent and orderly: supports Label mechanism and transactional import to ensure Exactly Once semantics during the writing process; supports primary key model Sequence column settings to ensure orderliness during data import.
  • Excellent query performance: Doris supports Rollup pre-aggregation and materialized views to complete query acceleration; supports vectorization processing to reduce virtual function calls and Cache Miss; supports inverted index to accelerate full-text retrieval or range queries such as text, ordinary values, and dates.
  • Supports high concurrency point queries: supports partitioning and bucketing, partitioning time through Partition, and setting the number of Buckets to filter unnecessary data to reduce underlying data scanning and achieve quick query positioning; in addition, new rows are added in Doris 2.0 version A series of optimizations such as traditional storage format, short path enumeration, and preprocessing statements further improve enumeration execution efficiency and reduce SQL parsing overhead.
  • Supports multiple data models: supports star schema to meet the needs of associated query of billions of data tables; supports aggregation of large and wide tables to provide extremely fast query performance and multi-dimensional analysis capabilities for single tables.
  • Simple architecture, easy operation and maintenance, easy expansion, and high availability: Doris FE nodes are responsible for managing metadata and multiple copies, and BE nodes are responsible for data storage and task execution. This makes the architecture simple to deploy and configure, and easy to operate and maintain; at the same time, Doris can add and subtract nodes with one click, automatic copy completion and load balancing between nodes, making it easy to expand; and when a single node fails, Doris can still maintain the cluster. Stable operation meets our requirements for high service availability and high data reliability.

Cigna Investment 2.png

We can also see from the comparison chart that, whether in real-time or offline scenarios, Apache Doris has the most balanced and excellent comprehensive capabilities, and can support self-service query, real-time and offline OLAP analysis capabilities, high-concurrency query and table association, etc. Query scenarios, and has excellent performance in writing performance, high availability, ease of use, etc. It is a component that can meet multiple business scenarios.

Architecture 2.0: Unified technology stack based on Apache Doris

Cigna Investment 3.png

The two generations of data warehouse architecture are mainly different in terms of storage, calculation, query and analysis. Version 1.0 relies on multiple components to jointly build the OLAP analysis engine. During the business expansion stage, problems such as architecture storage redundancy, data latency, and excessive maintenance costs gradually arise. Architecture version 2.0 is based on the upgrade and transformation of Apache Doris, replacing the four components of Presto, MySQL, HBase, and Clickhouse and migrating data to Apache Doris to provide unified external query services.

The new architecture not only unifies the technology stack , but also reduces costs in development, storage, operation and maintenance, and further unifies business and data. A system based on Apache Doris can support both online and offline task processing to achieve unified data storage ; it can meet data analysis services in different scenarios, support high-throughput interactive analysis and high-concurrency point queries, and achieve unified business analysis .

01Accelerate data analysis efficiency

Through Doris' extremely fast analysis performance, in high-concurrency point query scenarios for C-end users, QPS can reach thousands to tens of thousands, and queries of hundreds of millions or billions of data can achieve millisecond-level response; use Doris' rich data import methods And efficient writing capabilities, achieving second-level write latency, and using Unique Key write-time merging to further accelerate query performance in the parallel read and write stage. In addition, we also used Doris hot and cold tiering to store massive historical cold data in cheap storage media, reducing the storage cost of historical data and improving the query efficiency of hot data.

02Reduce various costs and expenditures

Compared with the original architecture, the new architecture has reduced the number of core components by half, greatly simplified the platform architecture, and greatly reduced operation and maintenance costs. In addition, Apache Doris eliminates the need to complete data storage and query services through different components, unifies real-time and offline business loads, and reduces storage costs; the data service API no longer needs to merge real-time and offline data when providing external services, making the data service API Development costs during access are reduced to 50%;

03 Ensure high availability of data services

Because of Doris' unified storage, computing and service data warehouse architecture, the platform's overall disaster recovery plan is easy to implement, and there is no need to worry about data loss and duplication caused by multiple components. More importantly, Doris's built-in cross-cluster replication CCR function can provide second-to-minute synchronization of database tables between clusters . When a system crash causes business interruption or loss, we can quickly restore from backup.

The two major mechanisms of Doris's cross-cluster replication CCR function meet our demand for system service availability and ensure high availability of data services. The details are as follows:

  • Binlog mechanism: When data changes, through this mechanism we can automatically record the records and operations of data modifications, and build an increasing sequence of LogIDs for each operation to achieve data traceability and orderliness.
  • Persistence mechanism: After a system crash or an emergency occurs, this mechanism can persist data to disk to ensure data reliability and consistency.

Insurance front-line business income and practice

Currently, the real-time data warehouse based on the Apache Doris unified technology stack has been launched in Q3 2022 and put into production environment. It is used to support OLAP efficient analysis capabilities of massive data, and supports more business-related scenarios on the platform. In terms of business operations, the scale of sales leads is also constantly expanding and has now reached 100 million levels. With the further introduction of Apache Doris functions, front-line business revenue supported by data warehouses is also continuing to grow.

  • Efficient tracking of sales leads: At present, we have launched 30+ new scenario applications in sales and performance tracking. Business personnel can accurately and quickly obtain customer insurance information on official websites, APPs, malls, public accounts, mini programs and other channels based on sales leads. Tracks and data such as evaluations, live broadcast participation data, corporate micro-activity participation data, insurance-free insurance, etc. are converted into clues through Apache Doris multi-dimensional analysis, ultimately achieving accurate customer contact, effectively seizing customer motivations, and timely follow-up on orders.
  • High-frequency update of customer retention information: 20+ new scenario applications have been launched in the new customer conversion and old customer care categories. The smooth progress of business scenarios is inseparable from the data platform's high-frequency update capability of customer retention information. Apache Doris is used to update old customers Regular analysis of data can effectively query customers' insurance business needs at different stages, discover the protection gaps of old customers, broaden the insurable boundaries of old customers, and further increase business operating income.
  • Consistent access to business scenario data: In terms of customer service, we pay more attention to providing customers with a consistent experience and quick response services. At present, we have launched 20+ new scenario applications related to service experience to avoid information asymmetry and data inconsistency, and ensure that the data of each sales link can be consistent and unified in the underwriting, claims, customer service consultation, member center and other processes.

future plan

The introduction of Apache Doris plays a crucial role in simplifying the real-time data warehouse architecture and improving performance. Currently, we have replaced multiple components of Presto, Clickhouse, MySQL, and HBase based on Apache Doris to unify the OLAP technology stack, reduce various costs, and improve import and query performance.

At the same time, we also plan to further use Doris in the batch layer (Batch Layer) trial application to unify offline data batch processing in Doris to solve the problem of cost superposition and incompatibility of Lambda architecture in real-time and offline links, and truly realize The architecture unifies computing, storage, and analysis. At the same time, we will continue to leverage the unified advantages of Doris and use Multi-Catalog to allow data to flow freely between lakes and warehouses to achieve seamless and extremely fast analysis services on data lakes and multiple heterogeneous storages, becoming a more comprehensive set of A complete, more open and unified big data technology ecosystem.

We are very grateful to the SelectDB team for their continued technical support. At this point, Cigna Data Warehouse is no longer limited to simple report scenarios. It supports data analysis in a variety of different scenarios through a set of architectures, meets the unified writing and query of real-time and offline data, and provides services for product marketing, customer operations, Businesses such as C-side and B-side provide data value, allowing insurance personnel to obtain data more efficiently, predict customer needs more accurately, and gain opportunities for enterprises.

In the future, we will continue to participate in the construction of the Apache Doris community and contribute the insurance industry’s real-time data warehouse construction experience and practical applications. We hope that Apache Doris will continue to grow and develop and contribute to the construction of basic software!

Guess you like

Origin blog.csdn.net/SelectDB_Fly/article/details/133019959