dataCompare big data comparison of heterogeneous data comparison

From 0 to 1, introduce the open source big data comparison platform dataCompare

The function of dataCompare has been introduced in detail. At present, dataCompare has realized the comparison of data from the same source

1. The existing core functions of dataCompare are as follows:

(1) Order of magnitude comparison

(2) Consistency comparison

(3) Difference case automatic discovery

(4) Timing scheduling automatic comparison data

2. Background

However, the above functions are currently only for the comparison of homologous data. First, let’s introduce the concepts of homologous and heterologous mentioned in the article.

Homologous data definition:

(1) The same data storage (for example: mysql and mysql, Hive and Hive, Doris and Doris), so only the same data storage is applicable

(2) The same database or the same cluster, for example: Hive and Hive data comparison, to ensure that the data can run through a cluster, usually using the cluster to compare and verify large batches of data

Therefore, it is very obvious that the current big data comparison platform function is only applicable to the comparison in data migration, but not to the data architecture upgrade. For example, the data originally stored in Hive needs to be migrated to Iceberg or Doris due to the architecture upgrade, etc. data storage.

Since the code was open sourced, it has received a lot of attention, and they all put forward requirements for heterogeneous comparison. So what is heterogeneous data?

(1) Different data storage, such as: Hive and Doris, Hive and Iceberg, etc.

(2) The same data is stored, but it may be cross-database and cross-cluster, that is, there is no way to compare the data on both sides in a cluster (if the data on both sides can be obtained in a cluster, it is still considered the same source)

3. Problems encountered

So for the comparison of heterogeneous data, is there any way, and what problems will be encountered?

At present, there are the following problems in the comparison of heterogeneous data:

(1) Cross-storage, different storages lead to different supported sql, for example, Hive and ClickHouse, obviously support different sql

(2) The amount of data is large, and there is no way to compare it directly in the memory. The data of the same source is compared by clustering

4. Solutions

In order to solve the heterogeneous data comparison, the following two solutions are currently proposed

(1) The simplest solution is to get the data together, that is, to convert it into the same source data for comparison, but this is obviously a waste of time, requires data synchronization, and also wastes storage

(2) Use some techniques to realize the comparison of heterogeneous data

This article mainly introduces how to use some data techniques to realize the comparison of heterogeneous data

For example: Due to the upgrade of the data structure, the previous Hive (user_info_hive) data was upgraded to Doris (user_info_doris) for storage (requirement: the table structure remains unchanged, only the data storage is upgraded)

(1) Calculate the pv and uv of user_info_hive and user_info_doris respectively, and record them as pv_hive, pv_doris, uv_hive, uv_doris respectively

(2) Use the hash method, a. hash the primary key user_id of user_info_hive and user_info_doris, and make statistics separately; b. concatenate and construct md5 hash statistics for all fields of user_info_hive and user_info_doris

(3) Consistency check: Select user_info_hive, user_info_doris, among which 1w pieces of data with the same primary key user_id hash value are compared to see how many pieces of data can be found on both sides at the same time

Therefore, it is very reasonable to compare data through the above data techniques, and it can also achieve the effect of data comparison.

Guess you like

Origin blog.csdn.net/weixin_43291055/article/details/128554020