"Basic concepts" of sub-table sub-database learning (1)

1. Why do you want to divide the table and the database?

1.1 A database performance bottleneck appears

For applications, if there is a problem with database performance, it is either that connections cannot be obtained because the number of connections is insufficient under high concurrency. Either the operation data is slow, and the efficiency of the database processing data is out of question. Either there is a storage problem, for example, the amount of data stored in a single machine is too large, and storage problems may also cause performance problems. In the final analysis, it is limited by hardware , such as CPU, memory, disk, network and so on. But our optimization is certainly impossible to directly start with the expansion of the hardware, because the benefits and cost input ratio are too much.

1.2 Traditional optimization methods of database

When we deal with data that cannot be connected or slowed down, we can start from these levels.

  • Design of tables and fields
    On the premise of satisfying business logic, design the table structure and fields as reasonably as possible. For details, please refer to "Alibaba Development Manual"
  • SQL and Indexes
    Because SQL statements are written on our application side, the first step is to optimize SQL statements in the program. The ultimate goal is to use indexes. This is the easy and most commonly used optimization method.
  • The table and storage engine
    data are stored in the table, and the table is stored in the storage engine in different formats, so we can choose a specific storage engine, or partition the table, and split the table structure (large table Split small tables) or redundant processing, or optimize the table structure such as field definitions.
  • Database cluster
    If there is only one database server, we can run multiple instances, do a cluster plan, and do load balancing.
  • Read-write separation
    Based on master-slave replication, read-write separation is realized, so that all writing services access the master server, and all read requests access the slave server. The slave server automatically synchronizes data from the master master server.
  • Add cache
    Add a layer of cache in front of the database to reduce the pressure on the database and improve access speed.
  • Database configuration
    Optimization of database configuration, such as the number of connections, buffer size, etc. The purpose of optimizing configuration is to use hardware more efficiently.
  • Operating system and hardware The
    last step is to optimize the operating system and hardware.

From top to bottom, the cost-benefit ratio is slowly increasing. So it's definitely not that the hardware is piled up as soon as the query is slow. The pile hardware is called scale up.

1.3 The ultimate optimization method-table and database

In order to disperse the storage pressure and access pressure of database services, we can also distribute different data to different service nodes. This is called scale out .

Pay attention to the difference between master and slave (replicate) and shard (shard) :

  • The master and slave achieve high availability through data redundancy and achieve separation of read and write.
  • Fragmentation distributes storage and access pressure by splitting data.

When do you need to sub-database and sub-table? What are our criteria? If it is the amount of data, how much data is stored in a table, do you need to consider sub-database sub-table? If it is the data growth rate, how much data is generated every day before we need to consider sub-database and sub-table? If it is the access situation of the application, how much time has passed since the query, and how many requests cannot get the connection before the database and tables need to be divided? This is a question worth thinking about.

2. Architecture evolution and sub-table sub-database

2.1 Single application single database

For example, the early banking system, this is a typical monolithic application. The characteristic of the monolithic architecture application is that all the codes are in one project, are packaged as a war package and deployed to tomcat, and finally run in a process. This set of banking system has hundreds of tables after initialization, such as customer information table, account table, merchant table, product table, loan table, repayment table, etc.
The structure diagram is as follows:
Insert picture description here
In order to adapt to the development of the business, this system is constantly being modified, the amount of code is getting larger and larger, and the system is becoming more and more bloated. In order to optimize the system, we set up a cluster, load balance, add cache, optimize the database, optimize business code, but can not cope with the access pressure of the system.

So at this time the system split is imperative. We split the previous core system of procurement into many subsystems, such as bill of lading system, merchant management system, letter review system, contract system, withholding system, and collection system. All systems still share a set of databases.

2.2 Multi-application single database

The code is decoupled and responsibilities are split. When problems occur in the production environment, they can be quickly investigated and resolved.

Insert picture description here
This kind of architecture where multiple subsystems share one DB will cause some problems.

The first is that all business systems share a DB, which cannot meet the demand from the perspective of performance or storage. As our business continues to expand, we will add more systems to access the database, but the amount of concurrency that a physical database can support is limited, and competition between all business systems will eventually lead to application performance Decline, or even bring down the business system.

2.3 Multi-application independent database

So at this time, we must also split the database of each subsystem. At this time, each business system has its own database, and different business systems can use different storage solutions.
Insert picture description here
Therefore, the sub-library is actually an inevitable result of the splitting of the system in the process of solving system performance problems. The current microservice architecture is also the same. Only disassembling the application without disassembling the database cannot solve the fundamental problem.

2.4 When is the table divided?

After we partitioned the original database tables, the data of some of the tables was still growing at a very fast rate. At this time, the query efficiency had also dropped significantly. Therefore, after the database is divided, the table needs to be further divided. The first may be to split data, partition or table in one database, and then split it into multiple databases. The sub-table is mainly to reduce the size of a single table and solve the performance problems caused by the data volume of a single table.

Insert picture description here
What we need to be clear is that although the sub-database and sub-table will improve the high availability of the system, it will also increase the complexity of the system. If you do not need to solve the storage and performance problems in the near or future period of time, do not do advanced design and Overdesign . Just like when we build a project, from the perspective of rapid implementation, it must start from a single project. Before the business is enriched and perfected, the microservice architecture is not needed. If the structure of the table we create is reasonable, there are not too many fields, and the index is created correctly, there is no problem with storing tens of millions of data in a single table. This is still subject to the actual situation of the application. Of course, we will also make a prediction for the business development in the future.

3. Types of sub-table and sub-database

In terms of dimensions, there are two types, one is vertical and the other is horizontal.

  • Vertical segmentation: Based on table or field division, the table structure is different. We have sub-tables for single database and sub-databases for multiple databases.
  • Horizontal segmentation: Based on data segmentation, the table structure is the same, the data is different, and there are horizontal segmentation and multi-database segmentation.

Insert picture description here

3.1 Vertical division

There are two types of vertical sub-meters

  • Single library vertical division
    Single library sub-table, such as: merchant information table, split into basic information table, contact information table, settlement information table, attachment table, etc. Split a table into multiple tables vertically
  • Multi-database vertical split
    Multi-database vertical split table is to split the different tables originally stored in one database into different databases .

Insert picture description here

3.2 Horizontal segmentation

When we sub-database processing of the original table, if some business system data still has a very fast growth rate, such as the repayment history table of the repayment database, the data volume reaches several hundred million. At this time, performance problems caused by hardware limitations will still occur, so from this perspective, vertical segmentation does not fundamentally solve the problem of excessive data volume in a single database and single table. At this time, we also need to do a horizontal segmentation of our data.

When the number of our customer tables has reached tens of millions or even hundreds of millions, there will be problems with the storage capacity and query efficiency of a single table. We need to further divide the data of a single table horizontally. The table structure of each database in the horizontal segmentation is the same, but the data stored is different, for example, each database stores 10 million data.

Horizontal segmentation can also be divided into two types

  • Single database level segmentation,
    such as the transaction flow table of a bank. All incoming and outgoing transactions need to register this table, because most of the time customers query the transaction data of the day and the transaction data within one month, so we put this according to the frequency of use Split the table into three tables:
    • Today table: Only the data of the current day is stored.
    • Current month table: Run a scheduled task at night, and all data from the previous day will be migrated to the current month table. Use insert into select, and then delete.
    • History table: It is also through a timed task to migrate the data registered for more than 30 days to the history history table (the data in the history table is very large, and we create partitions on a monthly basis).

But note that, like partitioning, although this approach can solve the problem of single-table query performance to a certain extent, it cannot solve the problem of stand-alone storage bottlenecks.

  • Multi-database horizontal split
    Multi-database horizontal split table. For example, the customer table, we split it into multiple database storage, the table structure is exactly the same.
    Insert picture description here
    Generally speaking, the sub-database and sub-table are all cross-database sub-table.

Since sub-library and sub-table can help us solve the performance problem, should we do it right away, or even divide it into several libraries during project design? Let's calm down, let's take a look at the problems that the sub-database and sub-tables will bring, that is, the complexity that comes after the sub-database and sub-tables we mentioned earlier.

4. Problems caused by sub-database and sub-table

4.1 Cross-database related query

For example, when querying contract information, you need to associate customer data. Since contract data and customer data are in different databases, we definitely can't directly use join to do related queries.
We have several main solutions:

  1. Field redundancy When
    we query the contract table of the contract database, we need to associate the customer table of the customer database. We can directly put some frequently related query customer fields into the contract table, in this way to avoid cross-database related query problems.

  2. Data synchronization For
    example, if the merchant system wants to query the product table of the product system, we simply create a product table in the merchant system and synchronize the product data regularly through ETL or other methods.

  3. Global tables (broadcast tables)
    such as bank name and bank number information are used by many business systems. If we put it in the core system, each system must be associated with queries. At this time, we can store the same foundation in all databases. data.

  4. ER table (binding table)
    Some of the data in some of our tables have a logical primary and foreign key relationship, such as the order table order_info, which stores the total number of goods and the amount of goods; the order detail table order_detail is the price of each product. The number and so on. Or called the affiliation, the relationship between the parent table and the child table. There will often be associated query operations between them. If the data of the parent table and the data of the child table are stored in different databases, cross-database associated queries are also troublesome. So can we put the parent table and data and the data subordinate to the parent table to one node (database)?

    For example, the data of order_id=1001 is in node1, and all its detailed data is also placed in node1; the data of order_id=1002 is in node2, and all its detailed data is placed in node2, so that it is still in the same database when querying.

The above ideas are to avoid cross-database related queries through reasonable data distribution . In fact, in our business, we also try not to use cross-database related queries. If this happens, we must analyze the business or data splitting. Not reasonable. If there is still a need for cross-database association, then we can only use the last method.

  1. Assemble data
    at the code layer The data that meets the conditions are queried in different database nodes, then reassembled, and returned to the client.

4.2 Distributed transaction

In a loan process, the contract system registers the data, and the lending system must also generate a loan record. If the two actions do not succeed or fail at the same time, data consistency problems will occur. If it is in one database, we can use msql local transaction to control it, but it won't work in a different database. Therefore, we also need to solve the affairs in the distributed environment through some solutions.

4.2.1 The basis of distributed systems-CAP theory
  1. C (Consistency) Consistency : For a specified client, a read operation can return the latest write operation . For data distributed on different nodes, if the data is updated at a node, if the latest data can be read on other nodes, then it is called strong consistency. If there is a node that does not read Get it, that is distributed inconsistency.

  2. A (Availability) Availability : Non-faulty nodes return a reasonable response within a reasonable time (not an error and timeout response). The two keys to availability are reasonable time and reasonable response . Reasonable time means that the request cannot be blocked indefinitely and should be returned within a reasonable time. A reasonable response means that the system should clearly return the result and the result is correct

  3. P (Partition tolerance) Partition tolerance : When a network partition occurs, the system can continue to work. For example, there are multiple machines in the cluster here, and one machine has a network problem, but the cluster can still work.

    The three CAPs cannot be shared, only two of them can be met at the same time. Based on AP, we have the BASE theory again.

    Basically Available : When a distributed system fails, it is allowed to lose part of the available functions to ensure that the core functions are available.

    Soft state: Allows an intermediate state in the system. This state does not affect system availability. This refers to inconsistencies in the CAP.

    Eventually consistent: Eventually consistent means that after a period of time, all node data will be consistent.

There are several common solutions for distributed transactions:

  1. Global transaction (XA two-phase commit)
    What is two-phase commit?

  2. Distributed transaction based on reliable message service

  3. Flexible transaction TCC (Try-Confirm-Cancel) tcc-transaction
    What is a flexible transaction?
    Insert picture description here

  4. Best effort notification, send messages to other systems through message middleware (repeated delivery + regular proofreading)

4.3 Sorting, page turning, function calculation problems

When querying across nodes and multiple databases, limit paging and order by sorting problems will occur.
For example, there are two nodes, node 1 stores odd id=1,3,5,7,9......; node 2 stores even id=2,4,6,8,10...... execute select * from user_info order by id limit 0,10 needs to take out 10 on each node, then merge the data and reorder.

When performing calculations for functions such as max, min, sum, and count, it is necessary to execute the corresponding function on each shard first, and then aggregate and recalculate the result set of each shard, and finally return the result.

4.4 Global primary key avoidance problem

The MySQL database has an auto-increment attribute for fields, and Oracle also has Sequence. If it is a database, then it can be guaranteed that the ID is not duplicated, but after the level is divided into tables, each table will increase according to its own rules, and there will definitely be the problem of ID duplication. At this time, we cannot use local self-increment. Up

Several common solutions:

  • UUID (Universally Unique Identifier) ​​The
    standard form of UUID contains 32 hexadecimal digits, divided into 5 segments, 36 characters in the form of 8-4-4-4-12, for example: c4e7956c-03e7-472c- 8909-d733803e79a9.

    UUID as the primary key is the simplest solution. It is generated locally, has high performance and no network time-consuming. But the disadvantages are also obvious. Because the UUID is very long, it will take up a lot of storage space. In addition, there will be performance problems when indexing as a primary key and query based on the index. In InnoDB, the disorder of UUID will cause frequent changes in data location. , Resulting in paging.

  • Database table
    Maintain the serial number in a table in the database. This table records the type, number of bits, starting value, and current value of the global primary key. When other applications need to obtain the global ID, lock the row for update first, get the value +1 and return after updating. Concurrency is relatively poor.

  • Redis
    based on the characteristics of Redis INT auto-incremented, using the bulk of the write way to reduce the pressure on the database, each time to get the ID number of paragraphs range, used up to go get a database, can greatly reduce the pressure on the database.

  • The
    Insert picture description here
    core idea of Snowflake algorithm Snowflake (64bit) :
    a) Use 41bit as the number of milliseconds, which can be used for 69 years
    b) 10bit as the machine ID (5bit is the data center, 5bit machine ID), supports 1024 nodes
    c) 12bit as the millisecond The serial number (each node can generate 4096 IDs per millisecond)
    d) There is also a sign bit at the end, which is always 0.

    Advantages: The number of milliseconds is high, and the generated ID is increasing according to the time trend as a whole; it does not rely on third-party systems, and has high stability and efficiency. Theoretically, QPS is about 409.6w/s (1000*2^12), and the entire distribution There will be no ID collision in the integrated system; bits can be flexibly allocated according to their own business.

    The disadvantage is that it relies heavily on the machine clock. If the clock is dialed back, it may cause duplicate ID generation.

When we split the data and store it on different nodes, does it mean that multiple data sources will be generated ? Now that we have multiple data sources, we must configure multiple data sources in our project. Now the question is, when we execute a SQL statement, such as insert, on which data node should it be executed? Another example is a query. If it is only on one of the nodes, how do I know which node it is? Do I need to query all the database nodes to get the result? So, from the client to the server, at what level can we solve these problems?

5. Solutions for multiple data sources/read and write data sources

We must first analyze the process of SQL execution.
DAO-Mapper (ORM)-JDBC-proxy-database service

5.1 Client DAO layer

The first is in our client code, such as the DAO layer. Before we connect to a certain data source, we first determine which nodes need to be connected according to the configured fragmentation rules, and then establish a connection.

Spring provides an abstract class AbstractRoutingDataSource , which can realize dynamic switching of data sources.

step

  1. aplication.properties defines multiple data sources
  2. Create @TargetDataSource annotation
  3. Create DynamicDataSource and inherit AbstractRoutingDataSource
  4. Multi-data source configuration class DynamicDataSourceConfig
  5. Create the aspect class DataSourceAspect, and set the data source to intercept the class annotated with @TargetDataSource.
  6. Automatically assemble the data source configuration on the startup class @Import({DynamicDataSourceConfig.class})
  7. Annotate the implementation class, such as @TargetDataSource(name = DataSourceNames.SECOND), call
  • Advantages: No need to rely on ORM framework, even if ORM framework is replaced, it will not be affected. Simple to implement (no need to parse SQL and routing rules), and can be flexibly customized.
  • Disadvantages: cannot be reused, and cannot be cross-language.

5.2 ORM framework layer

The second is in the framework layer. For example, we use MyBatis to connect to the database, and we can also specify the data source. We can select the data source based on the interception mechanism (intercept query and update methods) of the MyBatis plugin.

5.3 JDBC driver layer

Whether it is MyBatis or Hibernate, or Spring's JdbcTemplate, they essentially encapsulate JDBC, so the third layer is the driver layer. For example, Sharding-JDBC encapsulates JDBC objects.

The core objects of JDBC:
DataSource: data source
Connection: database connection
Statement: statement object
ResultSet: result set

Then we only need to encapsulate or intercept or proxy these objects to achieve fragmentation.

5.4 Agency layer

The first three are implemented on the client side, which means that different projects have to make the same changes, and different programming languages ​​have different implementations, so can we extract this logic for selecting data sources and implementing routing? , Make a public service for all clients to use? This is the fourth layer, the agent layer. For example, Mycat and Sharding-Proxy belong to this layer.

5.5 Database Service

The last layer is implemented on the database service, that is, the service layer. Some specific databases or specific versions of databases can implement this function. slightly.

Guess you like

Origin blog.csdn.net/nonage_bread/article/details/111605610