What? Is there a huge amount of data that is not divided into databases and tables? Sharding-JDBC introduction and project combat

image.png

  • Key concept
  • Data fragmentation
  • Read and write separation
  • Implementation process
  • Project actual combat

Recently, the amount of data in many tables in the project has become larger and larger, which has caused some database performance problems. Therefore, I want to use some middleware for sub-library and sub-table to realize automatic sub-library and sub-table. After investigation, it is found that Sharding-JDBC is currently the most mature and widely used client component of Java sub-library and sub-table.

This article mainly introduces some core concepts of Sharding-JDBC and practical guides in the production environment, aiming to help members of the group quickly understand Sharding-JDBC and be able to quickly use it.

Key concept

Before using Sharding-JDBC, you must first understand the following core concepts.

Logical table

A general term for tables with the same logic and data structure of a horizontally split database (table). Example: The order data is split into 10 tables according to the mantissa of the primary key, which are t_order_0 to t_order_9, and their logical table is named t_order.

Real watch

A physical table that actually exists in a fragmented database. That is, t_order_0 to t_order_9 in the previous example.

Data node

The smallest unit of data fragmentation. It consists of data source name and data table, for example: ds_0.t_order_0.

Binding table

Refers to the main table and sub-tables that have consistent fragmentation rules. For example: t_order table and t_order_item table are fragmented according to order_id, then these two tables have a binding table relationship with each other. There will be no Cartesian product association in multi-table association queries between bound tables, and the efficiency of association queries will be greatly improved.

For example, if the SQL is:

SELECT i.* FROM t_order o JOIN t_order_item i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

Assuming there are two real tables corresponding to t_order and t_order_item, then the real tables have t_order_0, t_order_1, t_order_item_0, t_order_item_1.

When the binding table relationship is not configured, assuming that the fragment key order_id routes the value 10 to slice 0 and the value 11 to slice 1, then there should be 4 SQLs after routing, and they appear as Cartesian products:

SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_0 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

After configuring the binding table relationship, the routing SQL should be two:

SELECT i.* FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);
SELECT i.* FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id WHERE o.order_id in (10, 11);

Broadcast table

Refers to tables that exist in all fragmented data sources. The structure of the table and the data in the table are completely consistent in each database. It is suitable for scenarios where the amount of data is not large and requires associated queries with tables with massive data, such as dictionary tables.

Data fragmentation

Shard key

The database field used for sharding is the key field for horizontally splitting the database (table). Example: If the mantissa of the order primary key in the order table is taken as a modulo fragment, the order primary key is the fragment field. If there is no fragmentation field in SQL, full routing will be executed, which will result in poor performance. In addition to support for single-sharded fields, Sharding-JDBC also supports sharding based on multiple fields.

Sharding Algorithm

The data is fragmented by the fragmentation algorithm, and fragmentation by =, >=, <=, >, <, BETWEEN and IN is supported. The sharding algorithm needs to be implemented by the application developer itself, and the achievable flexibility is very high.

Currently, 4 types of fragmentation algorithms are provided. Because the sharding algorithm is closely related to business implementation, it does not provide a built-in sharding algorithm, but uses a sharding strategy to extract various scenarios, provide a higher level of abstraction, and provide an interface for application developers to implement sharding by themselves algorithm.

Precise sharding algorithm

Corresponds to PreciseShardingAlgorithm, which is used to handle the scenario where a single key is used as the sharding key = and IN is sharded. Need to cooperate with StandardShardingStrategy.

Range fragmentation algorithm

Corresponding to RangeShardingAlgorithm, it is used to process the sharding scenario of BETWEEN AND, >, <, >=, <= using a single key as the sharding key. Need to cooperate with StandardShardingStrategy.

Compound sharding algorithm

Corresponding to ComplexKeysShardingAlgorithm, it is used to deal with scenarios where multiple keys are used as the sharding key for sharding. The logic of multiple sharding keys is more complicated, and the application developer needs to deal with the complexity. Need to cooperate with ComplexShardingStrategy.

Hint fragmentation algorithm

Corresponds to HintShardingAlgorithm, which is used to handle scenarios where Hint specifies the shard value instead of extracting the shard value from SQL. Need to cooperate with HintShardingStrategy.

Sharding strategy

Contains the sharding key and the sharding algorithm. Due to the independence of the sharding algorithm, they are separated independently. What really can be used for sharding operations is the shard key + sharding algorithm, which is the sharding strategy. There are currently 5 fragmentation strategies.

Standard sharding strategy

Corresponds to StandardShardingStrategy. Provide support for fragmentation operations of =, >, <, >=, <=, IN and BETWEEN AND in SQ L statements. StandardShardingStrategy only supports a single sharding key and provides two sharding algorithms, PreciseShardingAlgorithm and RangeShardingAlgorithm.

PreciseShardingAlgorithm is required and is used to process = and IN sharding. RangeShardingAlgorithm is optional. It is used to process BETWEEN AND, >, <, >=, <= sharding. If RangeShardingAlgorithm is not configured, BETWEEN AND in SQL will be processed according to the whole database routing.

Composite sharding strategy

Corresponds to ComplexShardingStrategy. Composite fragmentation strategy. Provide support for fragmentation operations of =, >, <, >=, <=, IN and BETWEEN AND in SQL statements. ComplexShardingStrategy supports multiple sharding keys. Due to the complex relationship between the multiple sharding keys, there is no excessive encapsulation. Instead, the sharding key value combination and sharding operator are directly transmitted to the sharding algorithm, which is completely developed by the application. It can be realized by others, providing maximum flexibility.

Row expression fragmentation strategy

Corresponds to InlineShardingStrategy. Use Groovy expressions to provide support for fragmentation operations of = and IN in SQL statements, and only support single fragmentation keys. For simple fragmentation algorithms, you can use simple configuration to avoid tedious Java code development, such as: t_user_$->{u_id% 8} means that the t_user table is divided into 8 tables according to u_id modulo 8. The table name is t_user_0 to t_user_7. It can be considered as a simple implementation of accurate sharding algorithm

Hint fragmentation strategy

Corresponds to HintShardingStrategy. The fragmentation strategy is to specify fragmentation value through Hint instead of extracting fragmentation value from SQL.

Distributed primary key

Used to generate a globally unique id in a distributed environment. Sharding-JDBC provides a built-in distributed primary key generator, such as UUID, SNOWFLAKE. It also extracts the interface of the distributed primary key generator to facilitate users to implement a custom self-incrementing primary key generator. In order to ensure database performance, the primary key id must also increase in a trend to avoid frequent data page splits.

Read and write separation

Provides a read-write separation configuration with one master and multiple slaves, which can be used independently or with sub-databases and sub-meters.

  • In the same thread and the same database connection, if there is a write operation, subsequent read operations will be read from the main library to ensure data consistency
  • Hint-based mandatory main library routing.
  • In the master-slave model, the master library is used for both reading and writing in transactions.

Implementation process

The principle of Sharding-JDBC can be summed up very simple: the core is composed of SQL parsing => executor optimization => SQL routing => SQL rewriting => SQL execution => result merging process.

image.png

Project actual combat

Spring-boot project actual combat

Introduce dependencies

<dependency>
    <groupId>org.apache.shardingsphere</groupId>
    <artifactId>sharding-jdbc-spring-boot-starter</artifactId>
    <version>4.0.1</version>
</dependency>

Data source configuration

If you use sharding-jdbc-spring-boot-starter, and the data source and data sharding are configured using shardingsphere, the corresponding data source will be automatically created and injected into the spring container.

spring.shardingsphere.datasource.names=ds0,ds1

spring.shardingsphere.datasource.ds0.type=org.apache.commons.dbcp.BasicDataSource
spring.shardingsphere.datasource.ds0.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds0.url=jdbc:mysql://localhost:3306/ds0
spring.shardingsphere.datasource.ds0.username=root
spring.shardingsphere.datasource.ds0.password=

spring.shardingsphere.datasource.ds1.type=org.apache.commons.dbcp.BasicDataSource
spring.shardingsphere.datasource.ds1.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.ds1.url=jdbc:mysql://localhost:3306/ds1
spring.shardingsphere.datasource.ds1.username=root
spring.shardingsphere.datasource.ds1.password=

# 其它分片配置

But in our existing projects, the data source configuration is separate. Therefore, it is necessary to disable the automatic assembly in sharding-jdbc-spring-boot-starter, but refer to the source code to rewrite the data source configuration.

Need to add @SpringBootApplication(exclude = {org.apache.shardingsphere.shardingjdbc.spring.boot.SpringBootConfiguration.class}) to the startup class to exclude. Then customize the configuration class to assemble the DataSource.

@Configuration
@Slf4j
@EnableConfigurationProperties({
        SpringBootShardingRuleConfigurationProperties.class,
        SpringBootMasterSlaveRuleConfigurationProperties.class, SpringBootEncryptRuleConfigurationProperties.class, SpringBootPropertiesConfigurationProperties.class})
@AutoConfigureBefore(DataSourceConfiguration.class)
public class DataSourceConfig implements ApplicationContextAware {

    @Autowired
    private SpringBootShardingRuleConfigurationProperties shardingRule;

    @Autowired
    private SpringBootPropertiesConfigurationProperties props;

    private ApplicationContext applicationContext;

    @Bean("shardingDataSource")
    @Conditional(ShardingRuleCondition.class)
    public DataSource shardingDataSource() throws SQLException {
        // 获取其它方式配置的数据源
        Map<String, DruidDataSourceWrapper> beans = applicationContext.getBeansOfType(DruidDataSourceWrapper.class);
        Map<String, DataSource> dataSourceMap = new HashMap<>(4);
        beans.forEach(dataSourceMap::put);
        // 创建shardingDataSource
        return ShardingDataSourceFactory.createDataSource(dataSourceMap, new ShardingRuleConfigurationYamlSwapper().swap(shardingRule), props.getProps());
    }

    @Bean
    public SqlSessionFactory sqlSessionFactory() throws SQLException {
        SqlSessionFactoryBean sqlSessionFactoryBean = new SqlSessionFactoryBean();
        // 将shardingDataSource设置到SqlSessionFactory中
        sqlSessionFactoryBean.setDataSource(shardingDataSource());
        // 其它设置
        return sqlSessionFactoryBean.getObject();
    }
}

Distributed id generator configuration

Sharding-JDBC provides UUID and SNOWFLAKE generators, and also supports users to implement custom id generators. For example, a distributed id generator with type SEQ can be implemented, and a unified distributed id service can be called to obtain an id.

@Data
public class SeqShardingKeyGenerator implements ShardingKeyGenerator {

    private Properties properties = new Properties();

    @Override
    public String getType() {
        return "SEQ";
    }

    @Override
    public synchronized Comparable<?> generateKey() {
       // 获取分布式id逻辑
    }
}

Since the extension of ShardingKeyGenerator is implemented through the SPI mechanism of the JDK serviceloader, it is also necessary to configure the org.apache.shardingsphere.spi.keygen.ShardingKeyGenerator file in the resources/META-INF/services directory.

The content of the file is the full path name of the SeqShardingKeyGenerator class. When used in this way, just specify the type of the distributed primary key generator as SEQ.

At this point, Sharding-JDBC has been integrated into the spring-boot project, and data sharding related configurations can be performed later.

Data sharding in action

If the data level of the table can be estimated at the beginning of the project, of course, it can be processed according to the estimated value from the beginning. But in most cases, we are not prepared to predict the order of magnitude from the beginning. The usual approach at this time is:

  • The query performance of a table of online data began to decline, and the investigation was down because the amount of data was too large.
  • Estimate the future data level based on the historical data volume, and determine the database and table splitting strategy based on specific business scenarios.
  • Automatic sub-database sub-table code realization.

Let's take a specific example to illustrate the actual combat of specific data fragmentation. For example, the data structure of a table is as follows:

CREATE TABLE `hc_question_reply_record` (
  `id` bigint NOT NULL AUTO_INCREMENT COMMENT '自增ID',
  `reply_text` varchar(500) NOT NULL DEFAULT '' COMMENT '回复内容',
  `reply_wheel_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '回复时间',

  `ctime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  `mtime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
  PRIMARY KEY (`id`),
  INDEX `idx_reply_wheel_time` (`reply_wheel_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
  COMMENT='回复明细记录';

Sharding plan determined

First query the monthly new trend of the current target table:

SELECT count(*), date_format(ctime, '%Y-%m') AS `日期`
FROM hc_question_reply_record
GROUP BY date_format(ctime, '%Y-%m');

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-HHEkCQK1-1598924516626)()

The current monthly increase is about 180w, and it is estimated that it will reach 300w in the future (basically calculated by 2 times). It is expected that the data volume of a single table does not exceed 1000w, and the reply_wheel_time can be used as the shard key to archive quarterly.

Sharding configuration

spring:
  # sharing-jdbc配置
  shardingsphere:
    # 数据源名称
    datasource:
      names: defaultDataSource,slaveDataSource
    sharding:
      # 主从节点配置
      master-slave-rules:
        defaultDataSource:
          # maser数据源
          master-data-source-name: defaultDataSource
          # slave数据源
          slave-data-source-names: slaveDataSource
      tables:
        # hc_question_reply_record 分库分表配置
        hc_question_reply_record:
          # 真实数据节点  hc_question_reply_record_2020_q1
          actual-data-nodes: defaultDataSource.hc_question_reply_record_$->{2020..2025}_q$->{1..4}
          # 表分片策略
          table-strategy:
            standard:
              # 分片键
              sharding-column: reply_wheel_time
              # 精确分片算法 全路径名
              preciseAlgorithmClassName: com.xx.QuestionRecordPreciseShardingAlgorithm
              # 范围分片算法,用于BETWEEN,可选。。该类需实现RangeShardingAlgorithm接口并提供无参数的构造器
              rangeAlgorithmClassName: com.xx.QuestionRecordRangeShardingAlgorithm

      # 默认分布式id生成器
      default-key-generator:
        type: SEQ
        column: id

Sharding algorithm implementation

Precise sharding algorithm: QuestionRecordPreciseShardingAlgorithm

public class QuestionRecordPreciseShardingAlgorithm implements PreciseShardingAlgorithm<Date> {
  /**
   * Sharding.
   *
   * @param availableTargetNames available data sources or tables's names
   * @param shardingValue        sharding value
   * @return sharding result for data source or table's name
   */
  @Override
  public String doSharding(Collection<String> availableTargetNames, PreciseShardingValue<Date> shardingValue) {
      return ShardingUtils.quarterPreciseSharding(availableTargetNames, shardingValue);
  }
}

Range sharding algorithm: QuestionRecordRangeShardingAlgorithm

public class QuestionRecordRangeShardingAlgorithm implements RangeShardingAlgorithm<Date> {

  /**
   * Sharding.
   *
   * @param availableTargetNames available data sources or tables's names
   * @param shardingValue        sharding value
   * @return sharding results for data sources or tables's names
   */
  @Override
  public Collection<String> doSharding(Collection<String> availableTargetNames, RangeShardingValue<Date> shardingValue) {
      return ShardingUtils.quarterRangeSharding(availableTargetNames, shardingValue);
  }
}

Specific sharding implementation logic: ShardingUtils

@UtilityClass
public class ShardingUtils {
    public static final String QUARTER_SHARDING_PATTERN = "%s_%d_q%d";

    /**
    * logicTableName_{year}_q{quarter}
    * 按季度范围分片
    * @param availableTargetNames 可用的真实表集合
    * @param shardingValue 分片值
    * @return
    */
    public Collection<String> quarterRangeSharding(Collection<String> availableTargetNames, RangeShardingValue<Date> shardingValue) {
        // 这里就是根据范围查询条件,筛选出匹配的真实表集合
    }

    /**
    * logicTableName_{year}_q{quarter}
    * 按季度精确分片
    * @param availableTargetNames 可用的真实表集合
    * @param shardingValue 分片值
    * @return
    */
    public static String quarterPreciseSharding(Collection<String> availableTargetNames, PreciseShardingValue<Date> shardingValue) {
        // 这里就是根据等值查询条件,计算出匹配的真实表
    }
}

At this point, for the hc_question_reply_record table, use reply_wheel_time as the sharding key, and the processing of quarterly shards is completed. Another point to note is that after sub-database sub-table, it is best to bring the shard key as the query condition when querying, otherwise the whole database route will be used, and the performance is very low.

There is also that Sharing-JDBC does not support MySQL's full-text index very well. Pay attention to the use of the project. In summary, the whole process is relatively simple. If you encounter other business scenarios in the future, I believe you can definitely solve it according to this idea.

Guess you like

Origin blog.csdn.net/weixin_47067712/article/details/108334641