PolarDB-X partition creation column type change

background

Throughout the decades of development in the database field, an important reason why relational databases stand out is that it supports users to flexibly define and modify "data models".

As a cloud-native relational database, PolarDB-X also supports the modification of the data model through various DDL statements to meet the continuous development of the user's business. For example, you can use the ALTER TABLE statement to add columns, delete columns, modify Operations such as column types. However, as a distributed database, PolarDB-X usually divides data into multiple shards (also known as physical tables) in a logical table through some partition method, and these shards are distributed in different data nodes [1][2], which makes the implementation of DDL statements more complicated.

This article takes column type change as an example to briefly introduce how to execute the ALTER TABLE statement in PolarDB-X. First, there are two types of column type changes: one is to change the partition key column type, and the other is to change the non-partition key column type. For non-partition key column type changes, the logical DDL can be directly split into multiple physical DDLs, and directly pushed down to the corresponding shards for execution; for partition key column type changes, it is relatively complicated. While modifying the column type, It is also necessary to redistribute the data, because the modification of the partition key column type will affect the routing of the fragmentation. If it is simply pushed down, the data will not be queried when the partition key is used for query.

In fact, as a distributed database, changing the column type of a table, regardless of whether the changed partition key or not, needs to ensure the consistency of each shard and metadata. Therefore, changing the column type of a non-partition key is not just a simple Simple push-down execution is enough. There will be an article to explain in detail later. This article mainly explains how to change the column type of the partition key.

traditional implementation

Traditional distributed database middleware splits the table by sub-database and sub-table. It is usually not allowed to change the split key column type. If you want to make a change, you generally need to rebuild a table and stop writing. Re-import the data.

If you want to keep writing when you want to change, you need to maintain a set of double-writing logic while importing the stock data. This operation method is not only complicated, but also difficult to verify the correctness of the final data, which is easy to cause Data inconsistency problem.

accomplish

In the previous article, the implementation principle of PolarDB-X splitting rule change was introduced [3]. The change process also requires data re-segmentation.

As a classic case of data redistribution, the splitting rule change process needs to go through steps such as creating a new table, double writing, importing existing data, data verification, and traffic switching. The entire process is very mature. It is easy to imagine that changing the partition key column type can be done based on this process modification. The detailed implementation of this function is introduced below.

The incremental data double-writing, stock data synchronization, and how to switch traffic during the data redistribution process have been described in detail in the article , so I won’t go into details here. It is strongly recommended that students who have not read it read this article and the Online Schema again. Change this paper [4].

What needs to be added is that we have also implemented the physical data verification function based on the TSO snapshot [5] to ensure the correctness of the data before and after the change. Going back to the topic of this article, the changes in partition creation column type are different from the changes in splitting rules in the following points, which are explained in detail below.

  • create new table
  • For global secondary index (GSI) processing
  • Data validation

create new table

For the change of splitting rules, this function only modifies the partitioning rules and does not modify the column definitions. Therefore, the table structure of the newly created table is exactly the same as the original table, and the column definitions will not be modified. However, changing the column type of the partition key requires modifying the column definition of the partition key, so the table structure of the newly created table is not completely consistent with the original table, except for the column definition of the partition key, other definitions are the same.

Because the column definition of the new table is inconsistent with the column definition of the original table, there are implicit type conversions in the original incremental data double-writing and stock data synchronization processes. Do these two processes need to be modified? The answer is no, because for the partition key, the type implicit conversion of DN (data node) is compatible with CN (computing node), which can guarantee the use of the data before implicit conversion and the data after implicit conversion. All data can be routed to the same shard without worrying about routing issues.

In addition, students who are familiar with MySQL may know that in MySQL, there are differences between the conversion of ALTER TABLE MODIFY COLUMN and the implicit conversion logic of DML, which may cause inconsistencies between the downstream and upstream data synchronized through BINLOG. This problem will be discussed in the data verification chapter to answer.

Global Secondary Index Handling

In the article , PolarDB-X's global secondary index [6] is introduced. In order to facilitate the query back to the table, the global secondary index includes the primary key and partition key of the main table as the Cover column by default.

In order to ensure the consistency of GSI and main table data, when changing the main table partition key column type, the corresponding Cover column types of all GSIs also need to be changed at the same time, so changing the main table partition key column type will actually change the GSI table The data is also redistributed.

If the partition key of the main table is inconsistent with the partition key of the GSI, and the column type of the partition key of the GSI is changed, in order to ensure data consistency, the same process still needs to be followed.

Data validation

In order to ensure the correctness of the change, after creating a new table, enabling double-writing of incremental data, and synchronizing existing data, a data verification step needs to be interspersed. After the data verification is passed, the traffic is switched and the original table is gracefully offline. process.

Here is a brief introduction to the logic of data verification. First, we implemented a sequence-independent hash algorithm on the DN side and encapsulated it as UDF. When CN starts to verify, first use the TSO transaction to obtain the source table Consistency snapshot of the target table and the target table, and then perform the hashcheck calculation (parallel) of the whole table for each fragment corresponding to the DN end of the source table and the target table, and pull the results to the CN node for summary, and calculate the checksum of the source table It can be compared with the checksum of the target table at the end.

Isomorphic table (parallel between shards):

Heterogeneous tables (parallel between shards):

For the type change of the partition key column, the source table and the target table are not completely consistent in the column definition, and the direct verification will definitely cause the verification to fail. For example, the source table partition key column type is VARCHAR, there is a piece of data is '123abc', now change the partition key column type to INT, then the data corresponding to the target table partition key is converted into the hashcheck of 123, 123 and '123abc' The values ​​are naturally different, so the validation will fail. In order to solve the above problem, after creating a new table, a virtual column (not visible to the outside world) that is only used for data verification is added to the source table, and the virtual column calls the column type conversion function on the basis of the partition key column. For example, in the above example, if a virtual column is added to the source table, then the value of the virtual column is '123abc', which is the result of calling the conversion function 123, and the result is consistent with the target table. Data verification can be completed by adding a virtual column, and the column type conversion function called by the virtual column is consistent with the processing logic of ALTER TABLE MODIFY COLUMN conversion in MySQL, so it can also be verified that DML implicit conversion and ALTER TABLE Inconsistent MODIFY COLUMN conversion.

View DDL execution plan

The change of the partition key column type does not necessarily require data redistribution. For example, for the string column type, if you just want to increase the length without modifying the CHARSET and COLLATE of the string, then data redistribution is not actually required. Execute The process is similar for non-partition key column type changes. In addition, the user may just modify the partition key column type of the GSI, not the partition key of the main table, which will still cause data redistribution. It can be seen that there are actually several scenarios for column type changes. In order to facilitate users to quickly distinguish the specific process of column type changes, we provide operations similar to explian for DDL statements to use. The following uses the sysbench table to give a few examples. illustrate.

Create table statement:

Example 1, modify the non-partition key column type:

Example 2, modify the partition key column type and require data redistribution:

Among them, CREATE TABLE is to create a new table, DROP TABLE is to delete the old table after the verification is completed, and ALTER TABLE is to add a virtual column for data verification.

Example 3, modify the partition key column type without data redistribution:

First, change the partition key id column to varchar(30). This process requires data redistribution, and then change its type to varchar(60). The explain result is as follows. You can see that no data redistribution is required (no need to create and delete tables ).

Summarize

Flexible change of column types is an important feature of distributed databases. While PolarDB-X supports the change of the partition key column type, it ensures strong data consistency, high availability, transparency to business, removes the restrictions caused by distribution, and is very convenient to use. Based on the change of PolarDB-X splitting rules, this article briefly expounds the various technical points used in the process of implementing the change of partition key column type. The reason why PolarDB-X can support this function is the use of many such as TSO transactions. This is also one of the important features that distinguish distributed databases from distributed database middleware.

Author: Wu Mu

Click to try cloud products for free now to start the practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/131829278