Sub-table sub-library programs and technology selection algorithm (a) program algorithm

This paper describes the algorithm sub-library sub-table program, according to what rules divide. Several rules step by step way of comparing the present situation, I recommend the last fifth of the way. Problems caused by the subsequent chapter then describes technology selection and sub-library sub-table.

background

As the business volume increased, the amount of data increase, a table will be able to save large amounts of data, has millions of rows of data in a table, by sql optimization, improve machine performance can bear. Long-term perspective for the future should be sub-library sub-table at a certain extent, such as database performance bottlenecks occur, require time-consuming case of a long period of time increases the field. Solution under pressure independently of all data node, a plurality of distributed nodes, to provide fault tolerance, the system is not necessarily linked to a access.

purpose

This article describes the program sub-library sub-table is based on the case of horizontal segmentation, selecting different rules, advantages and disadvantages of the rules, I think a better plan recommended V.

  • A scheme: The Key modulo divisor gradually increase
  • Option Two: time division
  • Scheme 3: Pressing numerical range
  • Option 4: Consistency Hash idea - the average distribution scheme
  • Program V: Consistency Hash concept - an iterative addition of nodes (I think a better plan)

plan selection


A scheme: The Key modulo divisor gradually increase

Formula: key mod x (x is a natural number)

Key primary key can also be for the order number, or may be user id, this scenario need to decide, which as a query probability multi-use which.

advantage:

  • Increase the library table on demand, and gradually increase
  • Distribution, each one small difference

Disadvantages:

  • Many times will start 2 start in two libraries incremental, and then in three, four, five. As in the case of varying mod 3 mod 5, the modulo result most of the data will change after modulus, such as when key = 3, mod 3 = 0, a sudden change is mod 5 = 3, then the first will be from 0 migration table to table 3, it will cause a lot of data to move more than repeat position.
  • Data migration will be repeated, when two time points, the data No. 0 Table A, three points A to the data in Table No. 1, the four points A data table will return to No. 0

Option Two: time division

It can be daily, monthly, quarterly.

tb_20190101
tb_20190102
tb_20190103
……
复制代码

This algorithm requires order number, add the userId date or time stamp, or query interface to bring the date to which fragmentation in positioning.

advantage:

  • Continuous data by time
  • See more intuitive data growth

Disadvantages:

  • Considering that history began when a data sub-library sub-table did not follow sub-library sub-table, the order number is not necessarily time-stamped historical data, historical data may be incremented or custom algorithms derived from distributed primary key, causes the query when the upstream system must pass the order number, creation time two fields.
  • If the system is not a case where an upstream transmission time creation time, or the upstream system and the current system time to create the corresponding order is not the same day, the current data record in the database table need time field. Because the upstream system only pass the order number, the time required to obtain the creation time, the current system must have a primary table to maintain the relationship between the order number and time of creation, and every time you need to check the current system the main table query, and then check the specific table, which would consume performance.
  • Not necessarily uniform distribution: Monthly growth data is not the same, some months may be little more than some months

Recommended usage scenarios: Logging


Scheme 3: Pressing numerical range

表0 [0,10000000) 
表1 [10000000,20000000)
表2 [20000000,30000000)
表3 [30000000,40000000)
……
复制代码

advantage:

  • evenly distributed

Disadvantages:

  • Because the maximum value is unknown, it is impossible to use the timestamp as key, this method does not use auto-increment primary key of the table, because each table increment the number is not centrally maintained. So there is a need for Fa Fa or do the local system to maintain a unified key auto-incremented.

He said that the follow-up program recommended to briefly talk about the consistency of hash

Let me talk about the consistency of hash, some articles say that consistency is a Hash algorithm, I think it is not a specific formula, but a set of ideas.

1. Hash presupposes an annular space, a fixed maximum and minimum ring, connected end to end, forming a closed loop, int, long, the maximum and minimum values. Many articles will assume the position of 2 ^ 32, the maximum value is 2 ^ 32-1 minimum value is 0, i.e., 0 ~ digital space (2 ^ 32) -1, they just hash algorithm according to the usual way of example, the real sub-library fraction table is not the case with this number, so I would think consistency hash algorithm is actually a concept, not a real formula. As shown below

2. Design of a formula function value = hash (key), this formula will have maxima and minima, as key mod 64 = value; this formula is the maximum 64, the minimum is 0. The data is then fall on the ring.

3. Set the node node. The manner of setting node for ip hash, or customize a fixed value (a fixed value is to use the subsequent). Then node go counterclockwise until the previous node, via value = hash (key) all data are owned by this node tube. The hash (node1) = 10, the hash (key) = 0 to 10, the property data node1 tube.

General

This theory is not described in detail herein, its meaning is expressed primarily fixed maximum value, a minimum value to a maximum value and not changed, the subsequent node node only modify the position of the node and to achieve increased data reduction to each node of the tube, in order to achieve reduction in pressure.

备注:
* 不推荐对ip进行hash,因为可能会导致hash(ip)得出的结果很大,例如得出60,若这个节点的前面没有数据,则这个节点需要管大部分的数据了。
* 最好生成key的方式用雪花算法snowFlake来做,至少要是不重复的数字,也不要用自增的形式。
* 推荐阅读铜板街的方案 订单号末尾添加user%64
复制代码

Option 4: Consistency Hash idea - the average distribution scheme

Using a hash consistency theory, the sub-library selected hash (Key) is a reference equation value = key mod 64, sub-table formulas value = key / 64 mod 64, key fields such as the main order number are frequently queried userId or the like. (This formula will follow changes) We assume that the above-described formula, can be divided into 64 banks, each table 64, a table is assumed that 10 million rows. The maximum 64 * 64 * 10 million data, I believe there will be that day, so we have to this as the maximum reasonable, even 32 * 32 can be selected.

Because so many tables do not have access early, start building a multi-table so each table insert data, wasteful machine, so we know the maximum value in the case, we started with small numbers, so we will above grouped calculated value.

分组公式:64 = 每组多少个count  * group需要分组的个数 
数据所在环的位置(也就是在哪个库中):value = key mode 64 / count  * count  
复制代码

The following example is a set of 16, i.e. 16 pools, group = 16 this time library formula value = key mode 64/4 * 4, is divided by 4, will come to intercept a decimal integer, then * 4 times, It is the data location.

// 按4个为一组,分两个表
count = 4:
Integer dbValue = userId % 64 / count * count ;
复制代码

hash(key)在0~3之间在第0号库
hash(key)在4~7之间在第4号库
hash(key)在8~11之间在第8号库
……
复制代码

Note: 64 can be actually started is set as a library, 32 of a subsequent change is set two libraries, a library from the library to two, then four banks, gradually progressive.

1 points from the beginning of the expansion of the library iteration:

FIG example after the divided groups 16, 32 becomes assigned to groups, each library will need to come up with half of the data migrated to the new data, up to a partial expansion of 64 groups.

When you can see the need for expansion to double the amount of data needs to be migrated half to 2 ^ n increase, so will be relatively large sphere of influence.

advantage:

  • If the group is directly split 32, then once Comparative
  • If a large amount of data, without making excessive table can once manner.
  • evenly distributed
  • Do not need a program like that most of the data need to be migrated during data migration and duplicate migration, migration requires only half

Disadvantages:

  • It can be extended, but a large sphere of influence.
  • The migration of a large amount of data, although unlike most of the data migration program as a current program for each table or library will need to be migrated half of the data.
  • To once and for all, we need to shut down the whole data migration

Program V: Consistency Hash concept - an iterative addition of nodes

(I think a better plan)

Consistency hash program combines relatively scope of the program, which is combined with Option III and IV programs.

Fourth, the maximum range setting program 64, according to 2 ^ n exponential increase library or from a table number, so that the influence to bring the overall amount of data when the data each time a 1/2 split migrate scope it is relatively large, so either directly to the split group 32, group 64 once and for all, or 1/2 of the migration.

Now when I think of ways to maintain consistency hash concept, a one node to increase, rather than a program of four increments of 2 ^ nn nodes. However, the hash code value determination data required for the new node.


We already occurred based on 1 iterations divided the case of two libraries to do subsequent iteration demo, first look at the situation had split two libraries:

Data falls library named db64 No. 64 and No. 32 named library db32


Iterative two: the difference between Option IV directly increase two nodes, we just add a node, so that when users migrate data from the original impact of 1/2, 1/4 will only affect users.

// 按32个为一组,分两个库
count = 32;
Integer dbValue = userId  % 64 / count * count ;
if(dbValue<16){
    // 上一个迭代这些数据落在db32中,现在走新增节点名为db16号的那个库
    dbValue =  16;
    return dbValue;
} else {
    // 按原来规则走
    return dbValue;
}
复制代码

Iteration III:

This allows iteration a complete program of four kinds of migration

Before the migration can be the first on the line of code, so that only affect a quarter of people

// 在请求接口中增加逻辑
    public void doSomeService(Integer userId){
        if(迁移是否完成的开关){
            // 如果未完成
            Integer dbValue = userId  % 64 / count * count ;
            if(dbValue<16){
                //这部分用户暂时不能走下面的逻辑
                return ;
            }
        }
        return dbValue;
    }
}
复制代码
// 在分片时按32个为一组,分两个库
count = 32;
Integer dbValue = userId  % 64 / count * count ;
if(dbValue<16){
    // 上一个迭代这些数据落在db32中,有一半需要走新增节点名为db16号的那个库
    if(迁移是否完成的开关){
        // 如果已经完成,就去db16的库
        dbValue =  16;
    }
    return dbValue;
} else {
    // 按原来规则走
    return dbValue;
}
复制代码

By analogy, when the next round total of eight nodes, migrate only need to migrate 1/8.

May be at the first iteration, do not choose dbValue less than 32 selects dbValue do <8, so that the scope of the first iteration will be relatively small.

advantage:

  • Easy to expand
  • Data is gradually increased during gradually increasing node
  • Affect a small number of users
  • By iteration, reduce risk
  • Shorter migration time, as agile iterative thinking

Disadvantages:

  • Uneven period

Guess you like

Origin juejin.im/post/5d6b8dbef265da03f47c38df