Introduction and Best Practices of SequoiaDB Capacity Expansion

1 Introduction

  With the annual growth of unstructured data in business systems, the amount of data is increasing. After the business system is put into production, the available storage capacity of the cluster gradually decreases due to the increase in business volume. Therefore, before the business system is connected to the cluster, the horizontal expansion of the entire cluster after the storage capacity is exhausted needs to be considered. SequoiaDB is a distributed architecture document database, so it can achieve near-linear growth of cluster performance through cluster expansion. After the expansion, the two main problems to be solved are the capacity problem of data storage and the performance problem of the entire cluster. Due to the continuous growth of the data volume and the promotion and use after going online, it is necessary to expand the capacity to improve the cluster performance and increase the data storage space.

2. Typical application scenarios

1. The amount of old data is large, and you do not want to move data;

  The data volume of the business system database table is very large. Due to the continuous growth of the business volume since it was put into production, the available storage capacity of the cluster has been exhausted, and there is no extra disk resources to meet the ability to move old data. It is expected that the problem of insufficient storage capacity can be solved without moving the existing old data to meet the needs of later business growth.

2. The amount of old data is not large, the best performance is expected, and the data can be moved;

  The data volume of the business system database table is not large, but it contains daily business operations, especially a large number of front-end operations, which require high query response speed, concurrency and throughput. It is expected that the extreme read and write performance can still be obtained after the amount of data increases.

3. Completely non-stop service, transparent to the application;

  The business system has bottlenecks in data storage capacity or cluster performance and needs to be improved. However, the business system needs to provide external services 24 hours a day. It is expected that the expansion of the database will be completely transparent to the upper-layer business.

4. There can be a time window for stopping the service, which is transparent to the application;

  Similar to the above three scenarios, but there is a certain time window to stop the service and make database changes. It is expected that the business system does not need to be changed to complete the expansion.

5. Do not change the database, hope to update the application;

  The application code is relatively easy to change, modify the code logic, and complete the expansion of the cluster.

3. Comparison of advantages and disadvantages

Comparison of advantages and disadvantages of several different solutions for the above application scenarios:

 

main sub table + import import

main sub table + split

main sub table + data field

normal partition table + split

Ordinary partition table + export import

Do not move data

no

Yes

Yes

no

no

Whether to provide read during migration

Yes

Yes

Yes

Yes

no

Whether to provide write during migration

no

Yes

Yes

no

no

Whether to provide 24 hours non-stop service

Yes

Yes

Yes

Yes

no

Whether it is transparent to the application, does not modify the application

Yes

Yes

Yes

Yes

Yes

Whether manual segmentation is required

Yes

no

Yes

no

Yes

Each solution will be described in detail in Chapter 5 below.

4. SequoiaDB supports expansion features

4.1 Vertical partition

     The vertical partition of SequoiaDB is also called the main sub-table, which stores the collection data in different sub-tables according to the specified partition key range. Similar to the concept of partition table in traditional relational databases, the main collection does not store any data records.

SequoiaDB Vertical Partitioning

detach atom table, attach

4.2 Split method

  The split method of the set is used for data segmentation. A large data set stored in some physical blocks is divided into several small parts according to the value of one or more fields, and these small parts are divided into small parts. part of it is distributed over more physical blocks.

4.3 Data Migration

  SequoiaDB provides data import (sdbimprt) and export (sdbexprt) tools, which can import data in JSON format or CSV format into the sdb database, and export the data in the sdb database collection to external files.

5. Application Cases

5.1 Test Environment

Cluster environment of V2.8.2:

ReplicaGroup

details

 

machine name

The port number

data storage path

Is it the master node

SYSCoord

Host1

11810

/opt/sequoiadb/database/coord/11810/

AND

 

Host2

11810

/opt/sequoiadb/database/coord/11810/

AND

 

Host3

11810

/opt/sequoiadb/database/coord/11810/

AND

SYSCatalogGroup

Host1

13000

/opt/sequoiadb/database/cata/13000/

N

 

Host2

13000

/opt/sequoiadb/database/cata/13000/

AND

 

Host3

13000

/opt/sequoiadb/database/cata/13000/

N

datagroup1

Host1

11910

/mnt/disk1/sequoiadb/data/11910

AND

 

Host2

11910

/mnt/disk1/sequoiadb/data/11910

N

 

Host3

11910

/mnt/disk1/sequoiadb/data/11910

N

datagroup2

Host1

11920

/mnt/disk2/sequoiadb/data/11920

N

 

Host2

11920

/mnt/disk2/sequoiadb/data/11920

N

 

Host3

11920

/mnt/disk2/sequoiadb/data/11920

AND

datagroup3

Host1

11930

/mnt/disk3/sequoiadb/data/11930

N

 

Host2

11930

/mnt/disk3/sequoiadb/data/11930

N

 

Host3

11930

/mnt/disk3/sequoiadb/data/11930

AND

datagroup4

Host1

11940

/mnt/disk4/sequoiadb/data/11940

N

 

Host2

11940

/mnt/disk4/sequoiadb/data/11940

N

 

Host3

11940

/mnt/disk4/sequoiadb/data/11940

AND

datagroup5

Host1

11950

/mnt/disk5/sequoiadb/data/11950

AND

 

Host2

11950

/mnt/disk5/sequoiadb/data/11950

N

 

Host3

11950

/mnt/disk5/sequoiadb/data/11950

N

datagroup6

Host1

11960

/mnt/disk6/sequoiadb/data/11960

N

 

Host2

11960

/mnt/disk6/sequoiadb/data/11960

N

 

Host3

11960

/mnt/disk6/sequoiadb/data/11960

AND

The following is a combination of the above features to complete several expansion methods in different scenarios:

5.2 The first way: main sub table + export import

 

  扩容前该主子表的每个子表(集合1-n)数据均匀存放在原数据域内数据组1-3上,如图1所示,目标是将每个子表(集合1-n)数据均匀分布到包含原数据组和新加入数据组的数据组1-6上,如图4所示,通过如下步骤进行扩容:

(1).将新加入的数据复制组添加到域中;

(2).在更新后的域中对主表的每个子表建立新的子表;

(3).建立管道文件;

(4).导出原子表数据到管道文件,同时导入管道文件中的数据到新子表中;

(5).导出导入完成后,校验数据的正确性,从主表上分离原子表,挂载新子表;

(6).删除原子表。

  此方式适用于,原来的数据组数据未接近饱和,最接近饱和的数据组,能容纳最大数据量集合除以扩容后数据组个数的存储空间。由于主表名称不受扩容影响,分离原子表,挂载新子表速度很快,能不停机提供查询操作,不需要改动上层业务系统正常运转。但在迁移过程中,集合中数据不能变动。

5.3  第二种方式:主子表+split

 

       扩容前该主子表的每个子表(集合1、2)数据均匀存放在原数据域内数据组1-3上,如图1所示,目标是新增子表(新集合1、2),新数据均匀分布到新加入数据组的数据组上,如图4所示。

  在建立子表时,设置autoSplit属性为false,使用Group属性,但使用Group指定集合建立的分组时,只能指定一个数据组,不能指定多个数据组。因此还需要在子表上,手动使用split方法,将数据范围切分到其他新增的数据组上。通过如下步骤进行扩容:

1.将新加入的数据复制组添加到域中;

2.在新加入的某个数据复制组上,建立新的子表;

3.使用split手动将新子表切分到其他新加入的数据组上;

4.挂载新的子表到主表上。

        此方式适用于主子表,且主表和子表之间带有时间特性,业务系统过来的新增数据存入到新的子表中,在按年或月信息作主子表分区键时,在当前阶段的子表数据接近饱和时,通过新增下一阶段的子表,来存放下一阶段的数据。 对业务系统透明且无任何运转的影响。但由于Group只能指定一个分区组,手动切分比较麻烦。

5.4  第三种方式:主子表+数据域

     由于数据域的指定是在集合空间上,所以新的子表不能建在原集合空间上,需要新建集合空间,指定数据域为新的数据域。

1.在新加入的数据复制组上新建域;

2.在新的数据域上,建立新的集合空间;

3.在新的集合空间上,建立新的子表;

4.将新的子表attach到主表上。

  和第三种方式差不多,适用场景一样,但少了手动切分的过程,替换为新增数据域和集合空间,利用域的autoSplit特性,自动完成切分过程。缺点是增加了域和集合空间,增加了复杂性。

5.5  第四种方式:普通分区表+split

 

1.将新加入的数据复制组添加到域中;

2.使用split方法对域中的每个集合的数据,手动切分到新加入的数据组中。

  最常规的一种方式,适用于非主子表结构的普通集合,能不停机提供查询操作。

注:在切分的过程中,从协调节点看,集合的数据量会存在波动;对该集合数据的增删改可能会出现问题,建议在没有修改操作的时候进行该切分操作。

5.6  第五种方式:普通分区表+导出导入

 

1. 将新加入的数据复制组添加到域中;

2. 导出域中所有的集合数据,以及集合空间定义、集合定义和索引定义;

3. 删除域中原有的集合空间;

4. 新建集合空间、集合以及索引。

5. 导入之前导出的数据到新建立的集合中。

        相当于是对整个域内的结构,进行重建,适用于集合空间、集合和数据组特别多,但域内数据量不多,或数据可以全部清空的场景。特别适用于初期建表测试时,发现表结构或容量不满足需求,需要重建的情况。

6.总结

        在不同的适用场景下选择不同的扩容方式,如果扩容前,采用了主子表的存储方式,根据主表分区键的特征,选取不同的扩容方式,如果分区键带有时间特性(逻辑上某段时间入库的数据,只会落在某些子表上),譬如业务暂时存储近期内的数据,远期的数据存储在未来部署的机器上的场景,可以使用方式二和方式三,而当新增数据组特别多时,方式三工作量更小;如果分区键不带有时间特性,可以使用方式一。如果扩容前没采用主子表的存储方式,可以使用方式四。如果在业务数据存储规划前期(数据量较小的情况下),建表后,发现起初规划的存储空间不够用,需要增加数据组,可以使用方式五,快速重建表结构。

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325369464&siteId=291194637