Postgres Migration Pitfalls: Collations

Data migration with PostgresSQL is convenient, but can lead to corrupted data if you don't take care of collation review and updates. Here are a few points to consider:

Note whether the glibc version matches

Using a mismatched glibc version may have the following risks:

  • Missing data when querying
  • Inconsistent ordering between versions
  • No unique constraint violation detected

These can all cause data corruption issues. For example, if there is a unique constraint on email addresses, and different versions sort results differently, then when two user accounts are queried at the same time. You may get empty results when querying. If the data corruption is a single record, it might be simple to fix the data corruption, but the larger the data size, the more difficult it is to clean up.

Differences between glibc versions can cause problems when:

  • Migrate databases (ie, wal-e, wal-g, pgbackrest) from one host to a new host using physical replication.
  • Restoring a binary backup (using pg_basebackup) on a system with a different OS configuration.
  • The operating system Linux is upgraded to a new version while keeping the PostgreSQL data directory. In this case, the glibc version may have changed, but the underlying data has not.

Not all types of migration or replication are affected by this inconsistency. Situations where data is transferred in a logical (non-binary) fashion are very safe, including:

  • Use pg_dump for the backup and restore process since the process only uses logical data
  • Logical replication, using only data copies and not physical copies

How Sorting Works

Here's an example, on glibc versions above 2.28, run this query and see how the data is sorted.

test=# SELECT * FROM (values ('a'), ('$a'), ('a$'), ('b'), ('$b'), ('b$'), ('A'), ('B')) AS l(x) ORDER BY x ;
 x  
----
 a
 $a
 a$
 A
 b
 $b
 b$
 B
(8 rows)

glibc

Libc is the main C library used by Linux systems. Many Linux programs, including Postgres, are implemented using glibc. It is used to provide many basic software operations and is used in Postgres to sort text or compare data when creating an index.

The glibc 2.28 update released in 2018 brings localization and collation information into compliance with the ISO 14651 4th edition 2016 standard. Due to updates, indexes created with a previous version of the collation may become corrupted when read with the newer collation. If there is a mismatch, the index must be rebuilt to avoid problems.

Collation query method

The data collation that the database is using can be found through the datcollate field of pg_database.

test=# SELECT datname, datcollate FROM pg_database;
  datname  | datcollate  
-----------+-------------
 test      | en_US.UTF-8
 security  | en_US.UTF-8
 template1 | en_US.UTF-8
 template0 | en_US.UTF-8

Check the version of glibc

ldd  --version //ldd command query version

ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

Repair method

  • Fix during migration

Since this problem occurs in binary data moved across versions of glibc across operating systems, it typically occurs during migrations. Migrating via logical copy or logical backup (i.e. pg_dump) eliminates this problem, as any affected indexes will be recreated at this point. So changing the recovery method to the logical recovery of the database is an effective way.

For large databases over 100GB, logical backup migration may take too long. In these cases, rebuilding the affected indexes after WAL migration is usually the preferred method, which minimizes downtime and effort.

  • on realtime database

If you think there may be a problem with the migrated collation, the plugin amcheck can help identify data inconsistencies.

As a side note: the plugin amcheck is a resource intensive process to run, both in terms of I/O and time. If the database is large, or if you don't want to affect the operation of the database, you can consider running the database on a physical copy, so that the production workflow will not be interrupted, because it can be detected on the copy of the database.

  • rebuild index

If problems are found in the above steps, you need to execute REINDEX or REINDEX CONCURRENTLY. (Note: If using Postgres 14, version 14.4 or higher is recommended to properly rebuild the index online to avoid further risk of corruption).

Guess you like

Origin blog.csdn.net/hezhou876887962/article/details/129431205