ES dual center data audit (synchronization)

Data audit scenario

Based on the ES dual-center usage scenario, it is necessary to periodically verify the data difference between the two ES indexes in near real-time and take measures to ensure data consistency.

Data audit program

Since ES generally carries a large amount of data, it cannot be directly queried in the memory for detailed comparison. The scheme adopted is as follows: query the total amount of data in the same time period of two cluster indexes according to the business time field, and calculate and compare all data keywords CRC32 cumulative value of the segment. The time period with different comparison results is continuously subdivided, and the data volume and CRC32 comparison are repeated, and finally a small data volume detailed comparison is performed in the memory to find inconsistent data.

Data audit steps

  1. Obtain the time period division type
    of index data in the specified time range Query and count the index data in the specified time range, find the maximum and minimum time, and calculate the time difference. Time period types include: day, hour, minute, second, the time difference is more than 1 day by day, more than 1 hour but not more than 1 day by hour, more than 1 minute but not more than 1 hour by minute, more than 1 Seconds but not more than 1 minute according to the second type.
  2. Use date_histogram to subdivide and
    query the index data of the specified time period and aggregate them according to the time period type. Use date_histogram to obtain the total amount of data in each time period of the two indexes.
  3. Compare the total amount of data in the same time period of 2 indexes. Compare the total amount of data in the same time period of 2 indexes. If they are inconsistent, no CRC32 comparison will be performed.
  4. Calculate and compare the CRC32 value of the same time period with the same amount of data.
    Use the scroll query method to query the detailed data within the same time period with the same amount of data and sort it by the specified field, calculate the specified field value through CRC32, and compare the final CRC32 value. If the CRC32 value is the same, it is considered that the data is consistent in this time period.
  5. Summarize the time periods with different amounts of data and different CRC32 values, repeat steps 2, 3, and 4 to refine the time periods in the order of days, hours, minutes, and seconds, and repeat steps 2, 3, and 4.
  6. Smaller data volume is used for memory detailed comparison, and the comparison result is stored for subsequent processing. When the
    data volume in a single time period is less than 10,000, it can be directly queried into the memory for two-way comparison to obtain the detailed difference result and store it.
  7. Periodically perform 1-6 steps to set timed tasks, such as auditing and comparing the previous day's data every morning. Or compare the data 10 minutes before every 10 minutes (ensure that the comparison time data does not change). If you want to synchronize, you can perform dual-center data correction or supplementary recording for the difference data.

test environment

Test program running hardware environment: windows7 CPU 2.2GHZ memory 8G
test program running software environment: jdk1.8, springboot1.4.1
Test environment container cluster: es-test
index: index_2019_08_19
comparison index: index_2019_08_17
time field: startTime
unique primary key: traceId, id combination

test method

  1. Create a new index index_2019_08_17, the structure is consistent with index_2019_08_19
  2. Write index_2019_08_19 data to index_2019_08_17 through reindex
  3. Modify or delete some data in index_2019_08_17 and index_2019_08_19
  4. Follow the ES data audit steps for comparison

Test Results

  1. Audit speed

The total data volume of a single index is about 230W, and the data of two indexes is different by one. The average audit completion time is within 60s

  1. Resource usage

During the audit process, 2 thread pools are used, each with 10 threads, 2 intermediate comparison threads, up to 22 threads, the CPU usage rate is between 20%-40%, and the memory usage is between 200M-500M.

to sum up

ES data audit function can be realized, but some problems need to be paid attention to

  • There must be a unique primary key or a combined primary key in the business data
  • When formulating the audit execution cycle, it is necessary to consider the inaccuracy of the audit caused by data delay, and ensure that the data remains unchanged within the audit time range

Guess you like

Origin blog.csdn.net/yml_try/article/details/108659522