Practice of database localization application transformation

In recent years, with the continuous advancement of localization and the development of distributed database technology, more and more enterprises have begun to choose domestic databases to replace their original databases. However, core database migration is a "frightening" IT operation, and a little carelessness can cause huge damage to "delete the database and run away".

Since domestic databases mainly adopt a distributed architecture with large technical differences, many enterprises often encounter various problems in the process of database localization application transformation, including but not limited to large differences in architecture, data incompatibility, syntax differences, performance and other issues. Leading to the miserable work of database migration.

App Retrofit That's All

The transformation of database localization application does not change at the business level. Its essence is to solve the differences between databases. The differences between localization and traditional databases are nothing more than the following aspects:

  • Architectural differences : There are differences in architectural planning, so you need to pay attention to these differences when designing the data model, and ensure the rationality of data storage and distribution as much as possible.
  • Functional differences : There are differences in syntax and functions, so you need to pay attention to these differences when transforming to avoid problems with grammatical errors, incomplete functions, or inconsistent data.
  • Performance difference : There may also be differences in performance, so you need to pay attention to these differences during transformation, and try to ensure that the migrated database system has high performance and stability.
  • Other aspects : There are other aspects that also need attention, such as security, scalability, maintainability, etc.

Our way of practice

In the face of these differences, in the process of application transformation, we must consider comprehensively and make corresponding adjustments and optimizations to ensure that the transformed system can meet business needs and have high performance, stability, and security. We combine the R&D management process and adopt the "3-stage 11-step" solution to realize the transformation of localized database migration and application.

  • Planning and design : According to the characteristics of the selected database, re-plan the structure of the database table by adjusting the table type, sharding key, sharding algorithm, etc., and design the most reasonable model.
  • Coding research and development : refine the database differences, and transform the code through interface adaptation, data type, syntax compatibility, and function compatibility to ensure that business functions are not damaged.
  • Performance tuning : Combined with the database optimization model, optimize the database and SQL for the driver layer, network layer, proxy layer, and node layer to ensure performance and reliability.

Distributed Data Model Design

Domestic databases generally adopt a distributed architecture, and their data model design process includes the following steps. Compared with traditional centralized databases, there are more fragmentation design, distribution design, and object optimization and adjustment. When applying transformation to the database, we complete these three steps.

Data Sharding Design

Before the sharding design, let's first understand the domestic database architecture, which includes three layers of computing nodes, data nodes, and global transactions. In addition, another very important concept is that distributed databases store data in distributed in the film. For example, GoldenDB shards are stored in MySQL instances. Domestic databases need to pay attention to a very important design: correctly fragment the data and give full play to the advantages of the distributed database architecture.

In order to meet the key principles of uniqueness, uniformity, and high data relevance in the selection of shard keys, there are three ideas as follows:

  • Try to select columns with more distinct values ​​to ensure that the data is evenly distributed. The purpose of uniform distribution is to avoid the barrel effect, and each host executes peer-to-peer.

  • Try to choose join columns or group columns to avoid data flow between data nodes and improve performance.

  • Try to select the columns with parameters to reduce the intermediate result set backflow CN node (computing node) and the SQL split calculation of the CN node.

After selecting the sharding key, you need to choose a sharding algorithm. The splitting algorithms of different distributed databases support LIST, RANGE, HASH, composite sharding, etc. The choice of splitting algorithm is closely related to business scenarios. Most business scenarios require Able to make good decisions as needed, here are a few common scenarios.

  • If split by time field, it is best to choose RANGE or LIST.

  • For batch scenarios, it mainly involves range query and batch write scenarios. If you have high requirements for range query performance, you can use RANGE, but it may cause unbalanced pressure on each shard written; if you have high requirements for batch write performance , use HASH, but the range query will definitely scan all shards, and the performance will be slower.

Data distribution design

The distribution design of the database table mainly considers two aspects: the redundant allocation of data and the striped allocation of locations:

  1. Redundant allocation : Map a piece of data to multiple nodes, also called replication table. Copying tables is a practice of exchanging space for performance. During design, by analyzing the table type, number of rows, access frequency, and association complexity, high-frequency access tables, configuration tables, and associated common driver tables can be set as copy tables.
  2. Striped allocation : for tables in a distributed database, try to select a unified sharding key, that is, most tables can scatter data according to this sharding key, so that when subsequent businesses access data, they can be in one The unitized closed-loop operation is completed in the shard without involving access across shards. During design, high-frequency SQL and slow SQL should be analyzed, and the analysis key of the table should be continuously adjusted to achieve data striped distribution.

object optimization design

The object optimization of distributed database architecture is mainly the adjustment of index design, which needs to give full play to the linear scalability advantages of distributed architecture. There are mainly two types of index design adjustments.

Primary key selection : Many centralized databases use self-incrementing serial numbers as primary keys, but their self-incrementing performance is poor and security is not high, so they are not suitable for distributed architectures. It is recommended to use a globally unique key as the primary key, such as the ordered UUID automatically generated by MySQL; a globally unique key generated by the business; or an open source UUID generation algorithm, such as the snowflake algorithm (but there is a time backtracking problem).

Ordinary index design : Index fields often do not contain shard keys, so all shards need to be queried when using the index, and the performance is relatively poor. The following two designs can be adopted: 1) set the index field as the shard key, and create an index table for implementation; 2) add additional shard key information to the index.

Code Four Compatibility Modifications

After adjusting the data model, the code is then modified for compatibility. There are two requirements: wide adaptation of database objects, and deep mining of differences and transformation points. First of all, in terms of breadth, the four aspects of database interface, data type, syntax, and function involved in the code should be comprehensively modified for compatibility.

Interface Adaptation Transformation

JAVA applications usually use the JDBC protocol to connect to the database. In addition to replacing the driver, it is also necessary to set the driver connection and high-availability feature parameters of the distributed database to make better use of the advantages of the distributed database. The following are commonly used in the transformation process of GoldenDB Driver protocol attribute transformation points.

data type conversion

Data type conversion involves the conversion between field type, precision, length, and character set, which can be realized by many mature third-party tools or manufacturer's tools, so I won't go into details here. It should be noted that the string length algorithm and character set issues, such as Oracle calculates the string length by bytes, MySQL (GoldenDB/OB) calculates by characters; UTF8 character set, Oracle defaults to UTF8mb3, MySQL (GoldenDB/OB) defaults to UTF8mb4. These types all require length conversion. The following is the precision, length, and character set support of GoldenDB field types.

Grammar Compatible Transformation

Mainstream distributed databases mainly choose PG (openGauss) and MySQL (OB, GoldenDB), so it is necessary to do a difference analysis on the syntax compatibility between Oracle, PG, and MySQL (distributed manufacturers will adapt the compatibility differences, Reduce the difference, this part is out of the scope of the discussion), and then develop and modify the incompatible items according to the map. The following are the main differences between the three:

Function compatibility modification

Similar to syntax compatibility, in addition to the manufacturer's compatibility modification, the function differences between Oracle, PG, and MySQL also need to be modified. The common function differences are as follows:

It should be noted that the field NULL value, the NULL value in the database is a special existence, and there may be some differences in different databases. For example, GoldenDB has the following restrictions on NULL:

  • NULL fields cannot be used as global indexes and shard keys.

  • NULL is not equal to any value (including TRUE, FALSE or NULL); relational operations (greater than, less than, equal to, not equal to) cannot be performed.

  • The NULL value can be judged as IS NULL or IS NOT NULL, and can be converted into an ordinary value with functions, such as NVL and NVL2 functions.

  • In the NOT IN subquery, if the result set returned by the subquery contains NULL values, the entire NOT IN condition is returned as empty.

After extensively analyzing the differences in the code, how to dig deep into the SQL with differences? There are two types of playback that can be used to collect database incompatibilities.

Method 1 : Use the MyBatis plug-in mechanism to write a custom plug-in and insert custom logic in the SQL Prepare stage to obtain and output SQL statements to realize the recording of application SQL.

Method 2 : Use Oracle's WCR tool to record database execution SQL. Although this method can capture all SQL, it cannot identify the application, and SQL needs to be filtered to eliminate non-application SQL such as operation and maintenance and Oracle internal affairs.

WCR playback process

Finally, convert the captured SQL into target library syntax, play it back to detect incompatible items, and form a modified standard item for R&D and testers to modify according to the diagram.

Database tiered performance optimization

Before analyzing database optimization, let's look at the path of SQL execution to understand the relevant factors that may affect the speed of SQL execution.

The execution path of most distributed databases is: application->driver->Proxy->data node, that is, SQL is initiated from the application, the driver is called, and the data node is parsed and split by the Proxy to reach the data node. The data node executes the SQL and sends the result Feedback to the Proxy for aggregation processing or directly return to the application. From this process, several factors that affect the SQL execution speed can be obtained:

  • The driver's connection performance.

  • Network transmission between the application and the proxy.

  • Aggregation processing capability of Proxy.

  • The SQL execution efficiency of the data node itself.

Driver layer optimization

The driver is the bridge and connection entry between the application and the database. When introducing the driver, parameter optimization is often ignored. SQL execution performance can be improved by optimizing driver parameters, such as database connection, SQL cache, load balancing, etc. For example, the GoldenDB driver can perform performance tuning through the following parameters:

Network layer optimization

The network transmission between the application and the Proxy is also an important issue, which may lead to problems including the network itself, the data acquisition mechanism, and the data result set is too large. The following are some pitfalls and solutions that we stepped on in the GoldenDB transformation:

Proxy layer optimization

Proxy is mainly responsible for data aggregation and secondary association of data obtained from each DN, which involves a unique distributed association operation, that is, "distributed JOIN".

There are many situations of distributed JOIN, which are mainly divided into two categories:

The JOIN that can be pushed down, that is, the JOIN operation is completed by the data node, and the Proxy layer summarizes the JOIN results. This method is more efficient because the storage and calculation are together, for example, the JOIN of the same dimension and the JOIN of the copied table.

JOIN of the same dimension

Copy table JOIN

For JOIN that cannot be pushed down, the data node performs single-table query, and the Proxy completes the JOIN operation. This method involves a large amount of data synchronization and is inefficient, such as Nested Loop Join.

Nested Loop Join

When tuning SQL at the proxy layer, the main consideration is to avoid Nested Loop Join. The following are some optimization experiences we have accumulated in the project:

Node layer optimization

A data node is actually a single MySQL or PG instance, and the optimization methods are not much different from traditional databases, so I won’t repeat them here.

epilogue

The localization of the database is not only for the label of "domestic", but also a manifestation of the development trend of cloud computing of "distributed instead of centralized". The reshaping of business software by this kind of technological innovation requires that the technology, characteristics and principles of distributed architecture be integrated into all stages of planning and design, code development and performance optimization in the process of database localization application reconstruction, so as to make full use of Advantages of distributed databases. In this way, the new system can have capabilities unmatched by traditional databases, and can provide business with higher performance, stability, and agile support. Only with the support of a professional team can we help customers stand out in the tide of database localization and enjoy a little bit of joy in the process of database migration.

This article is a little insight in the practice of localized database transformation, in order to provide some reference and inspiration for you who are "localized database transformation".

Guess you like

Origin blog.csdn.net/whalecloud/article/details/132421196