Consistency comparison of Maxcompute data on the cloud

I have written a lot of technical documents on how to logarithm and how to logarithm in batches. Recently, the project encountered this problem, and I found out that no article on this topic has been published on the official blog. It's like being dark under the lights, knowledge points that have been used for too long, but did not realize its importance.

Note: The logarithmic scenario here refers to the scenario where big data development tools such as dataworks are used on the Alibaba Cloud platform to integrate business system database ( oracle , etc.) data to the cloud to maxcompute . Therefore, the example SQL is also for maxcompute .

Let’s talk about the logarithm in general business first. We made a report and produced a data “30 units of a certain product were sold”. This data is not only available on the big data platform, but also in the business system. These statistical actions will also have a copy in the business system through procedures and manual work. Generally, this data will be checked first after the report is prepared.

Therefore, the data fed back from the front line is the problem of inconsistent summary data. However, this result is very general, because just like I feel that my salary is underpaid by 50 cents this month, if I don’t look at my salary slip, I actually don’t know if I am underpaid. The salary slip is not just a summary data, it contains a series of detailed data such as my pre-tax salary, bonus (floating), social security, tax deduction, etc. These data allow me to judge whether I have lost 50 cents, and the processed data is complicated.

Speaking of this, I actually want to express one thing, the logarithm is to compare the detailed data. This is the basis of all calculated facts, which can be used as evidence.

Therefore, both sides check the corresponding records of the table used by this summary value, for example, query "the sales record of this product ID today". As a result, it was found that there were 31 business systems and 30 big data platforms.

Even here, in fact, we still don't know what happened during the period and why the data was lost. In addition, we don't know whether the data of other product IDs are also lost, and whether similar situations will occur with the data of other tables.

1. Detailed data comparison

Since the detailed data is ultimately compared, can I directly compare the detailed data? The answer is: correct.

Generally, when this happens, the data in the two tables of the business system and the big data platform must be compared first.

image.png

1- Then use the full integration tool to extract all the data from the database of the business system to the big data platform. To compare data, the data must be put together, and there is no space comparison. Because the capacity of the big data platform is hundreds of times that of the business system, it is generally compared on the big data platform. (There is a paradox here. If the integrated tool itself is flawed, resulting in data loss during the extraction process, it will never be possible to compare. Therefore, if you are not sure about this tool, you have to export data from the database to a file. Then load it to a database for comparison. Here, through my years of experience in using this product for offline integration, this tool is very reliable and I have never encountered this problem.)

image.png

2- According to the primary key association, compare the differences of the primary keys in the two tables. If it is the problem of record loss mentioned above, it is easy to compare after this step. There is also a problem here, that is, the tables of the business system are constantly changing, so there will be differences when compared with the tables of the big data platform. The core reason for this difference is: the table of the big data platform is the data of a point in time at the end of each day (00:00:00) of the business system table, and the data of the business system is always changing. So, don't panic even if there is a difference that exceeds expectations. If real-time synchronization is used, every change in data during this period can be obtained from the archive log, and the cause of the change can be traced. If there is no real-time synchronization, you can also use the time-related fields in the table to determine whether the data has been updated. If there is nothing (this situation also exists), then go and scold the business system development of the design table (yes, it is their fault), or you can ask the business to find out in detail whether this record is made today Yes, not yesterday.

image.png

3- There is another situation where the primary key is consistent and the data content (fields other than the primary key) is inconsistent. In this case, it is still necessary to consider the data changes, which can be compared from several perspectives such as logs, time fields, and business. If you find that the data does not meet expectations, you need to query the problem of the synchronization tool.

2. Compare SQL Analysis

In the above section, I described the link of comparing the full table extracted today with the previous day's full table that was merged with the previous day's full table and the previous day's increment on maxcompute. The SQL method for comparing whether the sets of two tables are consistent is actually relatively simple, and everyone will immediately think of set operations. There are Minus and except in oracle, as well as in maxcompute. But in order to facilitate the analysis of the problem, I still wrote a SQL myself. Sample SQL (maxcompute sql) is as follows:

--limited date partition, compared with the previous day

select count(t1.BATCH_NUMBER) as cnt_left

,count(t2.BATCH_NUMBER) as cnt_right

,count(concat(t1.BATCH_NUMBER,t2.BATCH_NUMBER)) as pk_inner

,count(case when t1.BATCH_NUMBER is not null and t2.BATCH_NUMBER is null then 1 end) as pk_left

,count(case when t2.BATCH_NUMBER is not null and t1.BATCH_NUMBER is null then 1 end) as pk_right

,count(case when nvl(t1.rec_id ,'') = nvl(t2.rec_id ,'') then 1 end) as col_diff_rec_id

,count(case when nvl(t2.rec_creator ,'') = nvl(t1.rec_creator ,'') then 1 end) as col_diff_rec_creator

,count(case when nvl(t2.rec_create_time,'') = nvl(t1.rec_create_time,'') then 1 end) as col_diff_rec_create_time

from ods_dev.o_rz_lms_im_timck01 t1 -- Today's data of development environment reinitialization

full join ods.o_rz_lms_im_timck01 t2 -- The data that was temporarily incrementally merged in the production link yesterday

on t1.BATCH_NUMBER =t2.BATCH_NUMBER

and t1.IN_STOCK_TIME =t2.IN_STOCK_TIME

and t1.OP_NO =t2.OP_NO

and t1.STOCK_CODE =t2.STOCK_CODE

and t1.YP_ID =t2.YP_ID

and t2.ds='20230426'

where t1.ds='20230426'

;

--cnt_left 9205131 Description: The number of records in the left table is 9205131

--cnt_right 9203971 Description: The right table has the number of records 9203971

--pk_inner 9203971 Description: The number of consistent records associated with the primary key is 9203971

--pk_left 1160 Description: The left table has more records than the right table by 1160

--pk_right 0 Description: The right table has more records than the left table 0

--col_diff_rec_id 9203971 Description: The number of consistent records in the field is the same as the primary key, indicating that the fields in the two associated tables are consistent

--col_diff_rec_creator 9203971 Description: Same as above

--col_diff_rec_create_time 9203971 Description: Same as above

In the above example, the left table is the data reinitialized today, and the right table is the full data of the previous day merged on maxcompute. Before comparing, we should actually understand that the data in these two tables must be inconsistent. Although it is the same table, the time points are inconsistent.

Inconsistencies include several types:

The primary key that exists in the 1-t1 table does not exist in the t2 table;

2- The primary key that exists in the t2 table does not exist in the t1 table;

3- The primary key exists in both tables t1 and t2, but the values ​​of fields other than the primary key are inconsistent;

4- The primary key exists in both tables t1 and t2, but the values ​​of fields other than the primary key are the same;

Except for the case in No. 4, the values ​​of the other three states are inconsistent in the two tables, and further verification is required. Under normal circumstances, the first case is the data newly inserted into the business database after midnight today, the second case is the data in the business database deleted after midnight today, and the third case is the data updated after midnight today Data, the fourth case is the data in the business table that has not been updated after midnight today.

After understanding these conditions, we can identify

3. Dataworks real - time synchronization log table

If only the offline synchronization of DataWorks is available for synchronization, it is actually somewhat difficult to observe the above data changes. If we use real-time synchronization, the changes in the database data will be preserved. The changes mentioned in the previous chapter can be observed from the log. The following SQL is the SQL I used to query this change. The query (dataworks real-time data synchronization log table) sample SQL is as follows:

select from_unixtime(cast(substr(to_char(execute_time),1,10) as bigint)) as yy

,get_json_object(cast(data_columns as string),"$.rec_id") item0

,x.*

from ods.o_rz_lms_odps_first_log x -- real-time synchronization data source o_rz_lms log table

where year='2023' and month='04' and day>='10' -- data interval limit

--and hour ='18'

and dest_table_name = 'o_rz_lms_im_timck01' -- data table limit

-- The following primary key fields

and get_json_object(cast(data_columns as string),"$.yp_id") ='L1'

and get_json_object(cast(data_columns as string),"$.batch_number") ='Y1'

and get_json_object(cast(data_columns as string),"$.in_stock_time") ='2'

and get_json_object(cast(data_columns as string),"$.op_no") ='9'

and get_json_object(cast(data_columns as string),"$.stock_code") ='R'

--and operation_type='D'

order by execute_time desc

limit 1000

;

image.png

-- execute_time data operation time

-- operation_type operation type addition, deletion and modification of UDI

-- sequence_id serial number, will not repeat

-- before_image modified data

-- after_image modified data

-- The table name of the dest_table_name operation

-- data content JSON of data_columns operation

The real-time synchronized data source of DataWorks will be written into a table named "data source name + _odps_first_log" in real time every once in a while. The table has four levels of partitions: year, month, day, and hour. The primary key of this table is not the data operation time, but the serial number " execute_time ", so the update order of the primary key of a row of data is updated according to " execute_time ".

A row of data update has two states before and after, so there are two fields " before_image data before modification and after_image data after modification" to identify the before and after states.

The data is stored in JSON format in the field " data_columns ". In order to identify a certain row of data, I use a function to parse the corresponding field to determine the data I want.

4. Continuous quality assurance

So far, I haven't talked about how to deal with data inconsistencies. If it is indeed found that the data is inconsistent, the available processing method is to reinitialize the full amount of data. It should be emphasized here that if the offline full integration tool is credible, the fully initialized data will not be lost. But if this method is not credible, then it is necessary to change the method.

In many cases, some business changes at the source end will occasionally result in data exceptions. When the cause of data loss is not identified, it is necessary to frequently compare the data consistency to prevent problems before they happen. Therefore, the consistency of daily monitoring data is also very important.

[MaxCompute has released a free trial plan to speed up the construction of data warehouses] New users can receive 5000CU* hours of computing resources and 100GB of storage for 0 yuan, valid for 3 months. Get it now>>

Developers are welcome to join the big data computing MaxCompute community, https://developer.aliyun.com/group/maxcompute

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/9869479