Best practice of MRS large enterprise ERP process real-time data lake processing

This article is shared from the HUAWEI CLOUD community " Best Practices in Real-time Data Lake Processing of MRS Large Enterprise ERP Process ", author: Jin Hongqing.

This article will take the ERP process practice as an example to introduce the evolution of the MRS real-time data lake solution

Analysis of case practice requirements:

business description

  • AE form: Accounting entry form, which mainly records financial-related information and can be used for business calculations such as cost accounting. The most important table for the business is called the driver table.
  • Four-channel table: It is actually four store business systems, which mainly record sales record information. Provide information evidence for cost accounting, subject report analysis and other businesses. Can be called a dimension table.

business pain point

  • The pain point of subject analysis report business is slow supply and high data delay.
  • The actual business data is updated to ensure that the data is strictly consistent.
  • Account analysis report query only supports a small number of query conditions such as company, account, and time period.

Advantages of the real-time data lake solution

  • The real-time data lake solution performs incremental processing, unloading the traditional data supply pressure to every day, hour, and minute, and it only takes 2 minutes to query 1 million data.
  • Using Hudi as a data lake naturally supports data updates.
  • Provide all data archives, which can be traced back at any time.
  • Supports 31 query conditions such as subject, batch name, certificate name, contract number, etc., which greatly reduces the time for users to filter and filter after exporting data. Support users to analyze directly based on the page.

Real-time data lake solution implementation challenges

  • Stream computing is based on memory, and excessive peak data volume will affect job stability.
  • Multiple streams have a long delay, and data waiting consumes a lot of memory resources. It is necessary to consider the balance between business requirements and resource usage.

Flow processing model one:

cke_167.png

model-feature

• Hudi table stream reading can reduce overall memory overhead and improve job stability.

• Use one of the streams as a benchmark (left table) to compare the other stream (right table)

• There will be missing associations, from the perspective of the driver table (AE table) (new & updated)

    • 1) The four-channel stream arrives early , and the data is lost after the ttl expires

    •2) The four-channel stream arrives late , and the data is lost after the ttl of the AE stream expires

Model one limitation:

•The target wide table data will be inaccurate

• Added at the source end because no valid result can be associated, resulting in missing number of target wide tables   ->  missing

• The change of the source end causes the delay of the target wide table  ->  delay because no valid result can be associated

Flow processing model two:

cke_168.png

Purpose of Compensation:

Compensation purpose: Based on the business logic, compare the data content of the source-end flow table and the destination-end wide table, find the main fields of missing data in the target wide table, correlate the complete content of the source table to find the missing data, and write it back to the compensation layer of the source-end table.

missing&delay compensation simulation:

cke_169.png

Features of Model 2 : Compared with Option 1, a compensation mechanism is added, which can compare the source table (AE table, four-channel table) and the target wide table to find missing data missing and delay.

Limitations of Model 2 : In actual situations, the time delay between the two streams may be large and alignment may be difficult. Although the compensation mechanism can be used to retrieve missing data, the main role of the stream processing task will be weakened, and at the same time, it will cause greater pressure on the compensation task. Data latency will increase.

Flow processing model three (final):

cke_170.png

Purpose of double writing: The business system continues to double write data to Hudi tables and HBase tables. Hudi table stream reading provides the main hot-associated data, and HBase stores all historical data , which is technically a dimension table . After the hot-association fails, a quick lookup join (lookup join) is used to obtain an effective association. Improve the hit rate of dual-stream association. Reduce the overall data delay of stream processing.

Dimension table selection:

Selection description

Selection features

key parameter

Selection one

Dimension table data is placed in HBase and Redis, and when needed, lookup join association

Dimension tables have a large amount of data, which can balance performance and resources

•'lookup.cache' = 'ALL'

•'lookup.cache.ttl' = '600000',

•'lookup.cache.partitioned' = 'false'

•'lookup.parallelism' = '20'

Selection two

Dimension table data is placed in HBase, Redis, and Hudi , all of which are loaded into memory during processing and regularly refreshed

The amount of dimension table data is relatively small, which can achieve the best new performance

Model summary:

Program Features

Applicable scene

Solution 1: Dual-stream association

The most basic flow processing solution, without considering the situation that the link cannot be connected

Demonstration only, less use in actual scenarios

Solution 2: Dual-stream correlation + compensation

Compensation logic can meet the timing manual complement when the association is not connected, but if you want to achieve better data delay, you have higher requirements on Flink resources

Flink computing resources can meet actual business needs

Solution 3: Double writing + dual stream association + compensation

A relatively complete solution that can meet lower data latency and lower resource consumption, but the processing logic is more complicated

The actual scene is used more

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~ 

Guess you like

Origin blog.csdn.net/devcloud/article/details/132224977