From 0 to 1, introduce the open source big data comparison platform dataCompare

1. Background & current situation

I have also worked in the field of big data for many years. Regardless of whether I have worked in a large company or a small company, I will encounter related work such as data relocation during the cluster upgrade and migration process. I often encounter whether the relocated data is correct after the relocation. where? Is the data on both sides consistent? If not, what are the differences? Can you find the difference and solve the problem faster?

In the past, every student who developed often wrote some SQL scripts for comparison, and there was no evaluation standard. In this case, the efficiency is relatively low.

In fact, such a platform is actually mentioned in the book "Alibaba Big Data Road", but since it is not used externally, the introduction in the book is relatively simple. Therefore, based on previous work experience, a big data comparison platform was developed to assist in data verification, named dataCompare.

It mainly solves the following problems:

(1) Verification data and data comparison waste a lot of labor costs. It may take 1-2 hours for a table data to compare data to find difference data. If multiplied by the number of tables, the time cost is basically 2H*N (N is the number of tables)

(2) There is no set of standards, the verification results are difficult to evaluate, and the comparison standards for each comparison are not supported. Some may just look at the data volume, but in fact the data may not be correct.

(3) It is often to write a large section of complex SQL, and judge whether there is a problem by checking the results of SQL operation. Usually, it is necessary to debug SQL to ensure that SQL can run normally

2. Goal

In order to solve the above problems, a big data comparison platform - dataCompare was developed

(1) Automatic data verification and comparison can be realized by using interface interaction, checking methods or low-code methods, avoiding complex SQL debugging

(2) Establish a unified set of data verification standards to avoid inconsistent standards selected by different development students, such as: magnitude comparison, consistency comparison

(3) Improve the data inspection and comparison efficiency of the data team by at least 50%

3. Introduction to the core functions of the system

Currently dataCompare has completed the following functions:

(1) Interface-level interactive data comparison task configuration, low code and small amount of configuration to quickly generate comparison tasks

(2) Magnitude comparison, consistency comparison, automatic difference case discovery

(3) JDBC databases such as MySQL, Hive, and Doris are currently supported

The comparison process is as follows:

(1) New library information

(2) Select the data information that needs to be compared

(3) Perform comparison tasks

(4) Difference discovery

Automatically screen out difference cases to facilitate troubleshooting

4. System architecture design

The front end is mainly for data verification and comparison to select tables and fields, and generate verification tasks.

The backend mainly uses spring boot and Mybatis to write the configuration data of the frontend into the MySQL table, and then starts the MapReduce or Spark task for verification. Currently supported engines include: MapReduce, Spark, and data storage includes: HDFS, Hive, etc. In the future, consider expanding more data engines and storage engines.

5. System function demonstration

(1) Home page

(2) Database configuration page

(3) Contrast information configuration

(4) Comparison result display

(5) Automatic discovery of difference cases

6. Subsequent planning

(1) Strange table data detection, including: enumeration value detection, range value detection, primary key hash detection

(2) The comparison task is scheduled automatically, and the comparison result report is automatically sent to multiple channels such as mailboxes

(3) Comparison of heterogeneous data. At present, this project has realized the function of comparing data from the same source. In the future, we will consider expanding the comparison of heterogeneous projects.

7. The core code is open source

https://github.com/zhugezifang/dataCompare

https://gitee.com/ZhuGeZiFang/data-compare

Guess you like

Origin blog.csdn.net/weixin_43291055/article/details/128393823