[Sequoia] database SequoiaDB Sequoia Tech | distributed database one hundred billion oversized table optimization practice

01

introduction

With subscriber growth, business development, the amount of data large corporate users of business systems increases, performance issues large data tables become a major obstacle to business functions to achieve. Among them, the water table as the most common type of large table, a performance bottleneck enterprise users often encounter.

This paper focuses on the water like a large table, based on performance tuning SequoiaDB explore Sequoia database storage of large tables. SequoiaDB Sequoia database, a distributed database as a new generation OLTP, are widely used in mass data storage with high concurrent operation scenarios. For storage and high concurrent operation of huge amounts of data, distributed database compared to traditional database has a natural advantage, the rational use of SequoiaDB database giant sequoias variety of features, easy to solve performance problems oversized table.

02

Data storage is important to plan

For a large class of water tables, preliminary data storage planning is particularly important, reasonable data storage planning efficient use of the database cluster hardware resources to provide higher performance and more efficient data services.

  1. Scale clusters with the hardware configuration

At the beginning of the planning database cluster, database cluster through research needed to support application scale, positioning systems and long-term business development plan thoroughly to assess with a reasonable cluster size as well as the server's CPU, memory, hard disk, network card.

Accurate assessment of a database cluster size, is an ambitious and complex integrated project, needs assessment data needs to be some business support. Under normal circumstances, due to the fast changing needs of business, business growth is generally higher than expected, a small cluster planning can be assessed in accordance with 1.5 to 2 times of business research information, a large cluster planning can be assessed by 1 to 1.5 times.

Cluster size needs to be evaluated by business size, data storage size, maximum size of the data flow table for three years, three types of information. Scale of operations research from business needs to be visited, the proportion of various types of database operations, operating time distribution, and ultimately come to TPS trading all kinds of operations and operational data concurrency.

Scale data storage for water main categories of data to assess the stock data, three-year incremental data estimates, the size of data throughput and other information to conduct research, and ultimately come to cluster data storage size, data throughput, combined with the estimated TPS trading Disk IOPS. The size of the data base information table cluster of small-scale assessment of the impact, as a reference to assess the information needs only the overall size, for example, clusters need to store 1000 underlying table, average per 0.1GB, the overall need to 100GB, this magnitude in hundreds of TB's cluster size is negligible.

Specific operations to be stored to finalize the size of the cluster, using the balance disk or solid state drive in accordance with IOPS and data throughput, the size of each disk, a hard disk, etc. Several mount server, according to the TPS, and the ratio of the concurrent operation type the degree to configure the CPU and memory. In general, the CPU memory recommended ratio of 1: 8, a single hard disk capacity of 1.5 ~ 3T mainly, the greater the capacity of a single disk IO performance bottlenecks occur more easily, alternatively using Gigabit Ethernet network bandwidth according to the size of the cluster and cluster throughput or Gigabit networks.

2. how to build better water table

Class data stream typically comprises two dimensions wherein: a service time dimension and natural key dimensions. For the water table of the most common is to create a multi-dimensional partition table (Figure 1), the first collection of the data in the master table by different business dates cut into different sub-set of tables in order to ensure that a single set of data will not amount too large ; then, in sub-table set by the broken natural key data to the respective data node in the cluster, the amount of data on a single node in a single set of data is preferably in the order of less than one million. In addition, multi-dimensional business partitioned by date, it can be very simple implementation of cluster lateral expansion.

Note that, in the design collection of the water table space, the best annual or monthly to create a collection space, so that when the data can be quickly backed up data retention period is exceeded, delete and free up storage space.

[Sequoia] database SequoiaDB Sequoia Tech | distributed database one hundred billion oversized table optimization practice

3. And then proper planning can not once and for all

Then reasonable data storage planning, in the face of the rapid growth of data and ever-changing business needs, seemed so powerless, data growth will always exceed expectations, the performance bottleneck is always happened that day. This time, we need to optimize performance for specific performance problems.

03

Hardware resource performance bottleneck analysis

The use of cluster hardware resources, is an important basis for analysis of performance bottlenecks, through changes in the hardware resource usage, you can visually determine when the change should happen cluster business changes or operational changes, to tease out an analysis of ideas from changing.

1. Hardware resource monitoring data cluster is the basis

集群服务器各硬件资源的日常性能监控很重要,通过日常的监控数据,可以了解集群性能的波动情况。

nmon监控软件是是一款很出名的开源的系统性能监控工具,用于监控AIX\linux系统的资源消耗信息,并能把结果输出到文件中,然后通过nmon_analyser工具产生数据文件与图形化结果。如图2是nmon的运行主界面,互联网上有许多关于nmon工具的安装、实时监控、数据采集、分析报表生成等一整套教程,感兴趣的小伙伴可以了解一下。

 [Sequoia] database SequoiaDB Sequoia Tech | distributed database one hundred billion oversized table optimization practice
2. 集群硬件资源使用情况多维度分析

硬件资源使用情况分析,分为实时监控分析和监控报表数据分析,两者需要需要结合使用。

对于常态化的性能问题,先使用实时监控获取具体进程使用的各硬件资源的情况,再结合报表分析将监控的时间长度拉长,找到出现性能变化的起点,看这种性能变化是否是周期性的、随机性的或是渐变的,结合此时点前后的生产变更情况以及相关数据操作动作来定位问题。

对于已发生暂时没有重现的性能问题,先通过监控报表分析手段,分析性能问题发生时点前后的硬件资源变化,再将监控的时间长度拉长,尝试找出资源变化的规律,然后通过应用日志、数据库日志和生产变更情况,找出引发性能问题的操作。最后,尝试重新执行相关操作,并进行实时性能监控,验证相关问题猜想。

从硬件资源监控中找出性能变化的内在逻辑是问题分析的关键所在,掌握硬件资源的变化规律就初步掌握了集群的性能情况,也是进行性能调优的第一步。

04

数据库集群业务情况与操作分析

掌握了解硬件资源性能情况是性能优化的第一步,还需要结合集群业务情况与各类操作比例与时间分布,才能进一步精确的定位性能问题。

1. 业务情况摸底

业务情况摸底一般通过业务访问监控接口持续获取业务访问情况,并结合业务逻辑分析需要执行的数据库操作,获得业务TPS、并发度、数据操作类型分布、数据吞吐量等信息,以此评估数据库集群压力。

2. 数据分布情况分析

数据分布不均导致的性能问题是一个非常常见的问题,流水类超大表出现数据分布不均主要是由于业务主键字段随机性弱,无法很好的进行哈希打散,可使用其他随机性较强的字段代替,或者结合数据特点使用范围分区也是一个可以尝试的方法。

如何确定数据是否均匀分布?首先,通过查看主表集合的编目信息,检查业务日期切分是否均衡;然后,查看每个子表集合的编目信息,检查数据是否根据业务主键打散到所在数据域的各个节点;最后,抽查2~3张子表集合,检查子表集合在每个节点的数据量是否相当均衡。

3. 访问计划分析

通过query.explain({ Detail : true, Run : true })可获取查询的详细访问计划。通过分析数据操作语句的访问计划,可以掌握该数据操作的涉及到哪些子表集合,每个子表集合的每个节点的访问计划情况,包括访问计划的扫描方式、索引使用情况,数据情况,资源使用情况等信息。

4. 查询监控与全表扫描检测

通过抓取数据库集群会话快照db.snapshot(SDB_SNAP_SESSIONS),可以捕获每个查询的数据操作详细信息,包括会话状态、线程信息、索引数据信息、数据操作记录数、操作耗时、资源使用情况等等信息。

可通过不断抓取会话快照信息,监控会话快照信息中的"LastOpInfo"字段是否包含"tbscan"字样,来检测数据操作中是否存在全表扫描的情况。

05

性能优化指引

流水类超大表的性能优化,一般遵循从操作到软件再到硬件,由简到繁的一个优化思路,优化前需要充分了解软硬件情况,例如数据操作情况、数据库配置、系统性能情况等等。

1. 适量创建高效索引

索引是一种提高数据访问效率的特殊对象。恰当的使用索引可提高数据检索效率,但索引使用不当反而会降低数据检索速度,严重的还会造成数据库整体服务性能下降。

利用索引对流水类超大表进行性能调优,是最为经济、简单、高效的一种方式。那么,如何恰到好处的创建索引,即能提高数据操作效率,又不会对数据库服务性能造成影响?

首先,需要了解索引的可供调优参考的部分特性,以及相关性能开销成本。

部分相关特性:

使用二分查找,可快速定位数据,平均复杂度是O(logN)。

索引数据信息大小远小于表数据大小,索引数据可长时间缓存在内存。

使用B树结构,三层的B树可以表示上百万的数据,通过索引检索数据可减少磁盘I/O次数,数据字段越小索引效率越好。

I/O的次数取决于B树的高度H,假设当前数据表的数据为N,每个磁盘块的数据项的数量是M,则有:H=log(M+1)N,当数据量N一定的情况下,M越大,H越小;而M=磁盘块大小/数据项大小,磁盘块大小也就是一个数据页的大小,是固定的,如果数据项占的空间越小,数据项的数量越多,树的高度也就越低。这也就是为什么每个数据项,即索引字段要尽量的小,比如int占4个字节,要比bigint的8个字节小一半。这也是为什么B树要求把真实数据放在叶子节点内而不是内层节点内,一旦放到内层节点内,磁盘块的数据项会大幅度的下降,导致树层级的增高。当数据项为1时,B树会退化成线性表。

[Sequoia] database SequoiaDB Sequoia Tech | distributed database one hundred billion oversized table optimization practice

索引具有最左匹配特性,创建复合索引要根据数据重复率和查询使用的字段情况进行创建。

B树的数据项是复合性数据结构,按照从左到右的顺序来建立搜索树的,例如:当(小张,22,女)这样的数据来检索的时候,B树会优先比较name来确定下一步的搜索方向,如果name相同再依次比较age和gender,最后得到检索的数据。但是,当(22,女)这样没有name的数据来的时候,B树就不知道下一步该查哪个节点,因为建立搜索树的时候,name就是第一个比较因子,必须根据name来搜索才知道下一步去哪里查询。比如,当(小张,男)这样的数据来检索时,B树就可以根据name来指定搜索方向,但下一字段age缺失,所以只能把名字是“小张”的所有数据都找到,然后再匹配性别是“男”的数据了。

索引数据是有序的,支持顺、逆排序,合理利用可优化查询排序和分组效率。

数据重复率越低索引使用效率越高。

相关性能开销:

对数据操作都需要额外对索引进行维护,索引越多维护性能开销越大。

创建索引需要排序,需要额外存储空间,会获取表锁,创建时会消耗大量内存、CPU。

一次数据查询操作,至少产生2次IO操作,一次查询索引数据,一次访问表数据。

一次数据插入操作,至少产生1+n(n为索引个数)次IO操作,当索引当叶结点过满时会触发结点递归分裂时,IO操作会剧增。

一次数据删除操作,至少产生3+n(n为索引个数)次IO操作,当索引当叶结点过空时会触发结点递归合并时,IO操作会剧增。

一次数据更新操作,如果更新字段不是索引字段,则产生3次IO操作,如果更新字段为索引字段,则产生3+n(n为索引个数)次IO操作,且索引结点的分裂与合并均有可能发生(变长数据类型字段)。

掌握了索引相关特性和性能开销,结合流水类超大表的特点,通过评估确定数据插入、更新和查询操作的比例,调研更新字段与查询字段是否有重合等信息,以此来确定是否需要创建索引,创建哪些索引。

一般来说,流水类超大表都需要创建流水主键索引,以确保流水数据唯一性。其他的索引创建需要可根据更新、查询条件、数据规模等信息进行评估,通常创建2~3个索引效率性能比较为理想。如果数据查询操作为主,可酌情增加索引数量。对于多个字段组成的查询条件,可根据条件字段的重复率情况创建复合索引;对于有数据排序、数据分组的字段,可按排序分组字段创建复合索引。

2. 优化查询语句

对于流水类超大表的查询优化,有两个基本原则:一是通过索引调优,强制数据查询走指定索引,二是通过限制查询的业务日期范围,减少查询检索的数据规模。

索引调优可以结合本文 “适量创建高效索引”章节所述合理创建索引,并通过查看执行计划确保查询走索引。

合理的利用巨杉数据库主子表的特点,在查询时确定查询数据的业务日期维度,可以极大的减少检索数据规模,减少节省数据库集群资源,举个例子:应用需要通过流水号(主键)查询该流水号对应的流水数据,该流水号记有业务日期信息。通常情况下,简单的使用流水号就可以准确快速的查询到该条流水,但对于一个流水类超大表来说,可能存放着十年几十年的流水数据,数据可能存储在上百张子表中,如果单单使用流水号查询,那么数据库就需要对每个子表进行一次索引检索,最后只会有一张表检索到数据,但如果把流水号中的业务日期截取出来作为一个查询条件,那么数据库就可以通过主表的业务日期分区信息定位到该流水数据所在的子表,数据库仅会对该子表进行索引检索。

另外,对于对大范围业务日期查询,且需要分页处理的,可每次查询一个业务日期分区,分多次完成查询。对于查询数据结果集很大的查询,通常使用分页查询,单页数据量在数百以内为佳。

总的来说,查询语句的调优没有一套固定标准的操作方法,它是一个循序渐进的过程,只有通过不断的测试、优化、再测试、再优化,在不断迭代中逐步提高查询效率。

3. 调整数据切分粒度

流水类超大表什么时候需要调整数据切分粒度,可能是本节内容的最大挑战,需要综合考虑集群硬件资源使用情况、查询优化情况,子表单节点数据大小、索引大小等等一系列问题。

从硬件资源使用角度考虑,硬盘使用率在70%以下,超过70%可以考虑直接进行集群扩容,在集群CPU、内存基本保持在60%以下,只有IOPS居高不下,且IO读紧张IO写正常,查询业务TPS远低于IOPS,此现象说明1次查询会产生非常多的IO读;从查询优化情况看,已无全表扫描的查询,查询语句已做了足够的优化,但查询依然无法满足需求;以单子表单个节点的数据规模进行评估,数据规模需要控制在百万级别,单条数据记录的字段数与数据长度对数据检索也有影响,特别是数据更新操作,数据字段越多、数据长度越长更新效率越低;评估查询语句中的业务日期范围,查询越精确,切分粒度的调整弹性就越大;另外,根据创建的索引的字段类型以及实际索引大小,评估索引是否可以完全缓存至内存,尽可能将数据切分粒度调整至索引数据可完全缓存至内存。

实际上,调整流水类超大表的数据切分粒度,就是提高数据并发度,减少单子表在单节点的索引数据规模,使得整个索引数据可以常驻内存,最大程度减少IO开销,故而,调整数据切分粒度会增加内存、CPU资源的消耗。

如何调整数据切分粒度是需要探讨的另一个问题,对于按本文介绍的主子表建表规则创建流水类超大表,很简单的就可以完成调整动作,只需要将主表的每个业务日期区间,拆分成多个切分区间,即一个子表数据拆分成多个子表。关键的问题是,拆成多少个子表合理,一般建议一个子表拆分成2~3个子表较为合理,不过,最可靠的方式是——模拟生产测试。

4. 升级短板硬件

服务器停机升级短板硬件,对于天然具备容灾高可用特性的巨杉数据库来说,是一件不能再简单的事情,只需将需要升级的服务器的数据库主节点切换到其他服务器,就可以停掉数据库服务然后进行停机升级;如果是替换硬盘,在重启完成新硬盘挂载并创建相应数据目录后,重启数据库服务后,数据库会自动同步数据。

通过硬件资源的性能瓶颈分析,可以很清晰的看出硬件资源瓶颈,可根据业务需要酌情升级短板硬件,如添加内存条,将磁盘替换成固态硬盘,甚至可直接将整台服务器置换掉。

5. 集群扩容

When the database cluster is insufficient storage space, or the above-mentioned optimization methods are not well when resolve performance bottlenecks, you can proceed with the cluster expansion.

When the cluster expansion, needs to be adjusted and optimized according to the existing cluster, for example, the original data field has three servers, since the surge incremental data, the subsequent data field programming requires six or more servers. As another example, a new server reference to the old server hardware resource usage, need to adjust the hardware configuration, reduce idle configuration requirements, improve the configuration short board hardware resources, and so on. Thanks to a ready reference object, cluster expansion of the scale of assessment and evaluation of the hardware configuration can be much simpler.

For a large class cluster expansion operating water table is very simple, just install the server database software giant sequoias good by the new specification data node, the new node is added into the original cluster and as a new data field, and then the subsequent operations sub-table created date set in the new data field, then the sub-set of tables according to specifications set mounted to the main table, all the steps have been completed so far cluster expansion. As shown below, when a data field is not enough to use various types of hardware resources, may add a data field 2 stores the flow data after 2030.
[Sequoia] database SequoiaDB Sequoia Tech | distributed database one hundred billion oversized table optimization practice

 

06

to sum up

Performance optimization of water like a large table, is an ongoing iterative process needs to be established in the business layer to the database cluster, the data layer, the hardware resource layer has a basis of full understanding of the turn of the hardware resource analysis, cluster operations analysis and detection , business analysis, complete the positioning of performance problems.

Meanwhile, for users of the database giant sequoias, when dealing with large table optimization problem, the rational use of SequoiaDB Sequoia multidimensional database partitioning feature and easy scale-out mechanism, easy to solve performance problems large table in the face and then optimize the performance of large table do not worry -

Guess you like

Origin blog.51cto.com/13722387/2474282