[In-depth MaxCompute] Human Resources: Using the MaxCompute transaction table 2.0 primary key model to deduplicate data and continue to reduce costs and increase efficiency

Introduction:  MaxCompute’s new Transaction Table2.0 (hereinafter referred to as Transaction Table 2.0) table type will be available for testing on June 27, 2023. It supports near-real-time data storage and computing solutions based on Transaction Table 2.0.

Author: Shi Yuyang , Senior Data R&D Engineer at Renjia

Business introduction

Renrenjia is an Internet company jointly invested and established by Alibaba Dingding and Renrenwo to help customers enter the digitalization of human resources and rely on product and technology innovation to drive strategies. The company mainly provides human resources SaaS services including personnel management, salary management, social security management, and value-added services, accelerating the empowerment of the human resources field and realizing new ways of working in human resources. Currently, it has served customers in multiple industries in e-commerce, retail services and other fields.

Renrenjia is a typical start-up company. It is currently in a highly competitive market environment. The company has multiple products and the data of each product is independent. At the same time, in order to meet the internal CRM data needs and better integrate the data, for This is a big challenge for the data warehouse team, which requires stability, accuracy, and timely response. The data warehouse team is required to not only meet internal data needs, but also to optimize computing costs.

Business pain points

In the process of using Alibaba Cloud's big data computing service MaxCompute, we found that as the stock data increases, the cost of incremental data deduplication is getting larger and larger. The specific analysis found the following four reasons.

Incremental data is small in magnitude

Although the company has multiple products, the amount of new user data and historical changes every day is on a smaller scale (MB) compared to the scale of all historical data (GB).

Secondary calculation of historical data

For incremental data deduplication, yesterday’s full historical data + today’s new data are used to perform window deduplication calculations every day. However, the data part of the full historical data that needs to be updated is actually very small, and the historical data needs to be pulled out every time for window deduplication. Calculation, this is undoubtedly a relatively large calculation cost.

Windowing to remove duplicates is computationally expensive

Using the row_number function to open windows to deduplicate and obtain the latest data of the business primary key requires combining yesterday's historical data + today's data. The user table has a size of hundreds of millions. However, in order to save storage costs and subsequent modeling operations for data deduplication, this part of the cost is On the large side, in fact, most of the historical data has not been updated, and essentially does not need to be involved in calculation processing again. The estimated cost of a single SQL statement to deduplicate user tables once a day is 4.63 yuan (pay as you go).

The cost of pulling out the full amount is high

If the business database data is pulled in full every day, the amount of data will be in the hundreds of millions, but in fact the amount of updated data is small, which will put great pressure on the business-end db and seriously affect the performance of the business-end db.

Transaction Table2.0 data deduplication improvement

MaxCompute’s new Transaction Table2.0 (hereinafter referred to as Transaction Table 2.0) table type will be available for testing on June 27, 2023. MaxCompute supports near-real-time data storage and computing solutions based on Transaction Table 2.0. The Human Resources Data Warehouse R&D team began to understand its features and functions immediately. The Human Resources Data Warehouse team discovered that its characteristic primary key model can be used to deduplicate data and reduce windowing calculation costs. The main implementation methods are as follows.

  • Daily incremental user basic information is opened and duplicated;
  • Since the primary key of the primary key table cannot be empty, it is necessary to filter out the data whose business primary key is empty;
  • Directly insert the daily incremental data windowed and deduplicated data into the primary key table, and the system will automatically perform deduplication calculations based on the business primary key.

Specific practical measures to improve

Overall comparison
Deduplication SQL execution time (unit s) Estimated cost of deduplication SQL (unit yuan)
Ordinary watch 151 4.63
Transaction Table2.0 72 0.06
Cost and computation time comparison

1. Create table statements and insert update statements

1.png

update statement

1.png

2. Cost and calculation

Estimated cost of partition table deduplication operation:

1.png

The estimated cost cannot be used as the actual billing standard and is for reference only. Please refer to the bill for the actual cost.

Estimated cost of running primary key table deduplication:

1.png

The estimated cost cannot be used as the actual billing standard and is for reference only. Please refer to the bill for the actual cost.

Partitioned table calculation time and resources

1.png

Transaction Table 2.0 primary key table calculation time and resources

1.png

Through the above comparison, the daily calculation SQL cost of the user table dropped from 4.63 yuan to 0.06 yuan, the calculation time was shortened by half, reduce_num increased significantly, the map side decreased, and the amount of data on the reduce side increased significantly.

Merge small files

Transaction Table 2.0 supports near-real-time incremental writing and timetravel query features. In scenarios where data is frequently written, a large number of small files will inevitably be introduced. It is necessary to design a reasonable and efficient merging strategy to merge small files and deduplicate data. It solves the inefficiency of reading and writing IO for a large number of small files and relieves the pressure on the storage system, but it also needs to avoid serious write amplification and conflict failures caused by frequent compaction.

Currently there are two main methods of data merging:

  • Clustering: Just merge the Commit's DeltaFile into a large file without changing the data content. The system will be executed periodically based on factors such as the size of the newly added files and the number of files, and does not require manual operation by the user. Mainly solves the problems of small file IO reading and writing efficiency and stability.

1.png

  • Compaction: All data files will be merged according to a certain strategy to generate a batch of new BaseFiles. Data rows with the same PK only store the latest status and do not contain any historical status or any system column information. Therefore, BaseFile It does not support timetravel operations and is mainly used to improve query efficiency. It supports users to actively trigger according to business scenarios, and it also supports periodic automatic triggering by the system by setting table attributes.

1.png

To sum up, when incremental data is faced with the primary key surface, small files will not be merged immediately, so a large number of small files will be generated. Small files will occupy a large amount of storage space and are not conducive to data query speed. In view of the above situation , we can add small files that manually merge the primary key table after insert into, or we can automatically trigger the compaction mechanism according to dimensions such as time frequency, number of commits, etc. by configuring table properties, or wait for the Clustering merge by the system. If the data is updated only once a day, it is recommended to use the system's Clustering mechanism.

important point:

The file_num and size displayed by desc extend table_name include the recycle bin data. Currently, there is no way to display it accurately. You can clear the recycle bin data or use Compaction to observe the number of filenum at the end of the log.

Data time and space travel query and historical data repair

For transaction table 2.0 type tables, MaxCompute supports querying back to a certain historical time or version of the source table for historical Snapshot query (TimeTravel query), and also supports specifying a certain historical time interval or version interval of the source table for historical incremental query (Incremental query). Query), you need to set acid.data.retain.hours to use TimeTravel query and Incremental query.

Data time travel query

1. Query all historical data up to a specified time (such as a string constant in datetime format) based on TimeTravel (setting required)

select * from mf_tt2 timestamp as of '2023-06-26 09:33:00' where dd='01' and hh='01';

Query historical data and version number

show history for table mf_tt2 partition(dd='01',hh='01');

Query all historical data up to the specified version constant

select * from mf_tt2 version as of 2 where dd='01' and hh='01';

2. Based on Incremental query, historical incremental data in the interval of specified time (such as string constant in datetime format). The constant value needs to be configured according to the time of the specific operation.

select * from mf_tt2 timestamp between '2023-06-26 09:31:40' and '2023-06-26 09:32:00' where dd= '01' and hh='01';

Query the historical incremental data of the specified version interval

select * from mf_tt2 version between 2 and 3 where dd ='01' and hh = '01';
Data repair

Based on TimeTravel query, the full amount of data up to the specified time is directly inserted into a temporary table, the current transaction table 2.0 primary key table data is cleared, and the temporary table data is inserted into the current transaction table 2.0 primary key table.

Things to note and future plans

Dynamically hard delete data

There is no way to hard delete historical data (this part needs to rely on flink-cdc). Currently, it can be achieved through soft deletion, or through a period of historical data accumulation, all historical data can be filtered and re-inserted into the primary key table as a whole; one thing to mention here is flink-cdc+flink-sql supports real-time hard deletion of data, but the flink-cdc task of a single table is relatively heavy, and multiple tables require different server-ids. It is not recommended for business systems that have severe DB pressure at the source. Looking forward to it Subsequent cdas entire database synchronization.

Increased storage space

The data storage space of the transaction table 2.0 primary key model is slightly larger than that of data after windowing in the partition table. The main reason is that the data after windowing is more evenly distributed and the data compression ratio is larger. However, compared with the daily data of SQL The one-time computing cost and the daily cost of storage space are at a lower cost level (negligible).

flink-cdc

Cooperating with flink-cdc, it can directly realize quasi-real-time data synchronization and improve data freshness.

Whole database synchronization

We look forward to Alibaba Cloud's real-time computing Flink cdas syntax target integrating with the MaxCompute side to achieve full library synchronization and ddl changes.

materialized view

This can be done using the combination of materialized view + flink-cdc

Guess you like

Origin blog.csdn.net/weixin_48534929/article/details/132578477