What should I do if the data is inconsistent after the database is synchronized with Elasticsearch?

1. Problems on the actual battle line

  • Q1: Logstash synchronizes postgreSQL to Elasticsearch data is inconsistent.

When using Logstash to import a table from the pg library into ES, it is found that there is a large gap between the data volume in ES and the data volume of this table in the PG library. How to quickly compare which data is not inserted? During the import process, the Logstash logs are normal. This table in PG has 7600W.

  • Q2: In the mq asynchronous double-write database and es scheme, how to ensure the consistency of database data and es data?

2. One of the recommended solutions - ID comparison method

Take the example below to verify only question 1, and the principle of question 2 is the same.

2.1 Scheme Discussion

To find out which data is not inserted into Elasticsearch, you can use the following method:

  • Make sure that the input plugin's JDBC driver in the Logstash configuration file is properly configured to pull all data from the PostgreSQL database. Pay attention to the statement parameter, make sure it selects all the required data.

  • Check the output plugin of the Logstash configuration file to ensure that the connection parameters to Elasticsearch are configured correctly. Also, check to see if any filters have filtered out some data during the import process.

  • Add a stdout plugin to the Logstash configuration file to log data read from the PostgreSQL database to a file.

For example, the following could be added:

output {
  elasticsearch {
    ...Elasticsearch 配置...
  }
  stdout {
    codec => json_lines
    path => "/path/to/logstash_output.log"
  }
}

Compare the Logstash output file with the original data in the PostgreSQL database to find data that was not imported. A simple script can be written in Python, shell script, or another programming language to do this.

If the number of records in the Logstash output file matches the number of records in the PostgreSQL database, but not in Elasticsearch, check the health and logs of the Elasticsearch cluster. Check if the cluster is having problems receiving and indexing data.

If the problem persists, try reducing the batch size to reduce the load on Elasticsearch and Logstash. This can be achieved by setting the flush_size and idle_flush_time parameters in the output plugin of the Logstash configuration file.

When dealing with large amounts of data, it may be necessary to tune the performance and resource configuration of Logstash and Elasticsearch. Depending on hardware and network conditions, settings for batch operations, JVM settings, thread pool size, etc. may need to be tuned.

2.2 Implementation of comparison script

The following is an example of a simple shell script to compare a Logstash output file (in JSON format) with data in a PostgreSQL database. The script will compare specific fields such as id to determine which data may not have been imported into Elasticsearch.

dd1f8a0ceaa3587abe362e87ccbcdf48.png

First, export the data from the PostgreSQL database, saving it as a CSV file:

COPY (SELECT id FROM your_table) TO '/path/to/postgres_data.csv' WITH

Next, create a shell script called compare.sh:

#!/bin/bash
# 将 JSON 文件中的 ID 提取到一个文件中
jq '.id' /path/to/logstash_output.log > logstash_ids.txt

# 删除 JSON 中的双引号
sed -i 's/"//g' logstash_ids.txt

# 对 Logstash 和 PostgreSQL 的 ID 文件进行排序
sort -n logstash_ids.txt > logstash_ids_sorted.txt
sort -n /path/to/postgres_data.csv > postgres_ids_sorted.txt

# 使用 comm 比较两个已排序的 ID 文件
comm -23 postgres_ids_sorted.txt logstash_ids_sorted.txt > missing_ids.txt

# 输出结果
echo "以下 ID 在 Logstash 输出文件中未找到:"
cat missing_ids.txt

Add executable permissions to the script and run:

chmod +x compare.sh

./compare.sh

This script compares the IDs in the logstash_output.log and postgres_data.csv files. If missing IDs are found, they will be saved in the missing_ids.txt file and output to the console. Note that this script assumes that jq (a command-line JSON processor) is already installed. If not, install it first  jq.

3. Recommendation 2 - Redis Acceleration Comparison

In this case, Redis' collection data type can be used to store IDs in PostgreSQL databases and Logstash output files. Next, the missing IDs can be found using the set operations provided by Redis.

3326b1788127ce190fc4e08e809739cb.png

The following is an example of using Redis to achieve accelerated comparison:

First, export the data from the PostgreSQL database, saving it as a CSV file:

COPY (SELECT id FROM your_table) TO '/path/to/postgres_data.csv' WITH CSV HEADER;

Install and start Redis.

Load ID data into Redis using a Python script:

import redis
import csv

# 连接到 Redis

r = redis.StrictRedis(host='localhost', port=6379, db=0)

# 从 PostgreSQL 导出的 CSV 文件中加载数据
with open('/path/to/postgres_data.csv', newline='') as csvfile:
    csv_reader = csv.reader(csvfile)
    next(csv_reader)  # 跳过表头
    for row in csv_reader:
        r.sadd('postgres_ids', row[0])

# 从 Logstash 输出文件中加载数据
with open('/path/to/logstash_output.log', newline='') as logstash_file:
    for line in logstash_file:
        id = line.split('"id":')[1].split(',')[0].strip()
        r.sadd('logstash_ids', id)

# 计算差集
missing_ids = r.sdiff('postgres_ids', 'logstash_ids')

# 输出缺失的 ID
print("以下 ID 在 Logstash 输出文件中未找到:")
for missing_id in missing_ids:
    print(missing_id)

This Python script stores IDs using the Redis collection data type, then calculates the difference between them to find missing IDs. Python's Redis library needs to be installed first. It can be installed with the following command:

pip install redis

This script is a basic example, it can be modified and extended as needed. The advantage of using Redis is that it can process large amounts of data quickly in memory without reading and writing temporary files on disk.

4. Summary

Option 1: Using Shell Scripts and the grep Command

  • advantage:

(1) Simple and easy to implement.

(2) No additional libraries or tools are required.

  • shortcoming:

(1) is slower because it needs to read and write temporary files on disk.

(2) For the case of large data volume, it may lead to high disk I/O and memory consumption.

Solution 2: Use Redis to achieve accelerated comparison

  • advantage:

(1) It is faster because Redis is a memory-based data structure storage.

(2) It has good scalability and can handle large amounts of data.

  • shortcoming:

(1) The implementation is relatively complicated and additional scripts need to be written.

(2) Redis server needs to be installed and running.

Depending on your needs and data volume, you can choose an appropriate solution. If the amount of data to be processed is small and the speed requirement is not high, you can choose option 1, using Shell script and grep command. This approach is simple and easy to use, but may not perform well with large data volumes.

If you need to process a large amount of data, it is recommended to choose option 2 and use Redis to achieve accelerated comparison. This method is faster and can handle large data volumes efficiently. However, this approach requires additional setup and configuration, such as installing a Redis server and writing Python scripts.

In practical applications, it may be necessary to make trade-offs according to specific needs to choose the most suitable solution.

------

We have created a high-quality technical exchange group. When you are with excellent people, you will become excellent yourself. Hurry up and click to join the group and enjoy the joy of growing together. In addition, if you want to change jobs recently, I spent 2 weeks a year ago collecting a wave of face-to-face experience from big factories. If you plan to change jobs after the festival, you can click here to claim them !

recommended reading

··································

Hello, I am DD, a programmer. I have been developing a veteran driver for 10 years, MVP of Alibaba Cloud, TVP of Tencent Cloud. From general development to architect to partner. Along the way, my deepest feeling is that we must keep learning and pay attention to the frontier. As long as you can persevere, think more, complain less, and work hard, it will be easy to overtake on curves! So don't ask me if it's too late to do what I do now. If you are optimistic about something, you must persevere to see hope, not to persevere only when you see hope. Believe me, as long as you stick to it, you will be better than now! If you have no direction yet, you can follow me first, and I will often share some cutting-edge information here to help you accumulate capital for cornering and overtaking.

Guess you like

Origin blog.csdn.net/j3T9Z7H/article/details/131485416