Reprinted from Mycat Sharding Algorithm Learning

1 shard enumeration

1.1 Official Documentation

Configure sharding yourself by configuring possible enum ids in the config file.
This rule applies to specific scenarios. For example, some businesses need to be saved according to provinces or districts and counties, and the provinces, districts and counties across the country are fixed, and this type of business uses this rule. The configuration is as follows

<tableRule name="sharding-by-intfile">
    <rule>
        <columns>user_id</coulumns>
        <algorithm>hash-int</algorithm>
    </rule>
</tableRule>
<function name="hash-int" class="org.opencloudb.route.function.PartitionByFileMap">
    <property name="mapFile">partition-hash-int.txt</property>
    <property name="type">0</property>
    <property name="defaultNode">0</property>
</function>

partition-hash-int.txt configuration:

10000=0
10010=1
DEFAULT_NODE=1

The columns above identify the table fields to be sharded, the algorithm sharding function.
In the sharding function configuration, mapFile identifies the name of the configuration file, the default value of type is 0, 0 identifies Integer, and non-zero identifies String. All node configurations start from 0.

defaultNode default node: less than 0 means do not set the default node, greater than or equal to 0 means set the default
node The role of the default node: when enumerating fragments, if an unrecognized enumeration value is encountered, it will be routed to the default node.
If not configured The default node, when encountering an unrecognized enumeration value, an error will be reported

2 Fixed Fragmentation Hash Algorithm

2.1 Official Documentation

This rule is similar to the decimal modulo operation, the difference is that it is a binary operation, which is to take the lower 10 bits of the id binary (why?) .
The advantage of this algorithm is that if the modulo operation is based on decimal, when 1-10 are inserted continuously, they will be divided into 1-10 shards, which increases the difficulty of inserting transaction control, and this algorithm may be based on binary. It will be divided into consecutive shards to reduce the difficulty of transaction control

<tableRule name="rule1">
    <rule>
       <columns>user_id</columns>
       <algorithm>func1</algorithm>
    </rule>
</tableRule>

<function name="fuc1" class="org.opencloudb.route.function.PartitionByLong">
    <property name="partitionCount">2,1</property>
    <property name="partitionLength">256,512</property>
</function>

Configuration description:
partitionCount partition number list, partitionLength partition range list
Partition length: The default is a maximum of 1024, that is, a maximum of 1024 partitions are supported.
1024 = sum((count[i]*length[i])). count and length two The dot product of vectors is always equal to 1024

2.2 Personal Notes

That is, according to the above example, a table is divided into 1024 pieces, 256 pieces are taken away from the first two pieces, and 512 pieces are taken away from the third piece.
The meaning is unclear. When using string field fragmentation, according to the configured fragmentation rules, it is only related to the part of the last letter and the penultimate letter.
Only works on purely numeric columns. The benefits are understandable, it seems that it can only be applied to fields like id. As the documentation says, it is suitable for use when inserting continuously.
It remains to be seen where this sharding method is applicable

3 Scope conventions

3.1 Official Documentation

This shard is suitable for planning in advance which shard a range of shard fields belongs to.
start <= range <= end.
range start-end, data node index
K=1000, M=10000.

<tableRule name="auto-sharding-long">
    <rule>
        <columns>user_id</columns>
        <algorithm>rang-long</algorithm>
    </rule>
</tableRule>
<function name="rang-long" class="org.opencloudb.route.function.AutoPartitionByLong">
    <property name="mapFile">autopartition-long.txt</property>
    <property name="defaultNode">0</property>
</function>

    0-500M=0
    500M-1000M=1
    1000M-1500M=2

3.2 Personal Notes

Groups adjacent data of numeric types together. If you can obtain which shard table a certain piece of data is in, you can simplify the global sorting (you need to check it out )
**Will the value of the sharding basis change beyond the sharding limit, will it be redistributed? (Verification required)**

The column of data on which the sharding is based cannot be changed, and its usefulness is instantly reduced.

4 modulo

4.1 Official Documentation

This rule is the modulo operation on the shard field.

<tableRule name="mod-long">
    <rule>
        <columns>user_id</columns>
        <algorithm>mod-long</algorithm>
    </rule>
</tableRule>
<funciton name="mod-long" class="org.opencloudb.route.function.PartitionByMod">
    <property name="count">3</property>
</funciton>

This configuration clearly performs decimal modulo operation according to the id. Compared with the fixed shard hash, this method may insert a single thing into multiple data shards in batches during batch insertion, which increases the difficulty of error consistency.

4.2 Personal Notes

Where are the disadvantages, what are the advantages? ?

5 Shard by date (day)

5.1 Official Documentation

<tableRule name="sharding-by-date">
    <rule>
        <columns>create_time</columns>
        <algorithm>sharding-by-date</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-date" class="org.opencloudb.route.function.PartitionByDate">
    <property name="dateFormat">yyyy-MM-dd</property>121
    <property name="sBeginDate">2014-01-01</property>
    <property name="sEndDate">2014-01-02</property>
    <property name="sPartionDay">10</property>
</function>

If sEndDate is configured, it means that after the data reaches the shard of this date, the loop will be inserted from the beginning shard.

5.2 Personal Notes

what what? ? ? The documentation is unclear, you need to try it out in detail.

6 Modulo range constraints

6.1 Official Documentation

This kind of rule is a combination of modulo operation and range constraint, which is mainly to prepare for subsequent data migration, that is, it can independently decide the node
distribution of the modulo data.
"`html

user_id
sharding-by-pattern

256
2
partition-pattern.txt


```text
partition-pattern.txt




<div class="se-preview-section-delimiter"></div>

# id partition range start-end ,data node index




<div class="se-preview-section-delimiter"></div>

###### first host configuration
1-32=0
33-64=1122
65-96=2
97-128=3




<div class="se-preview-section-delimiter"></div>

######## second host configuration
129-160=4
161-192=5
193-224=6
225-256=7
0-0=7




<div class="se-preview-section-delimiter"></div>

The above columns identify the table field to be sharded, the algorithm sharding function, patternValue is the modulo cardinality, and defaultNode
is the default node. If the default is configured, the modulo operation will not be performed.

In the configuration file, 1-32 represents the distribution range after id% 256. If it is 1-32, it will be in partition 1, and so on. If the id is not data, it
will be allocated in the defaultNode default node.

6.2 Personal Notes

The range constraint after the modulo can solve the problem of designing shards based on the value of the sharding field. If the sharding field is similar to sales, if the sales increase exceeds the previously set upper limit, then the range Constrained algorithms will report errors or put all data in the default shard. This algorithm does not.
What do you mean by "preparing for subsequent data migration"? Why would such a shard do this?

7 Intercept numbers to do hash modulo range constraints

7.1 Official Documentation

This rule is similar to the modulo range constraint, which supports the modulo of data symbol letters.

<tableRule name="sharding-by-prefixpattern">
    <rule>
        <columns>user_id</columns>
        <algorithm>sharding-by-prefixpattern</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-pattern" class="org.opencloudb.route.function.PartitionByPrefixPattern">
    <property name="patternValue">256</property>
    <property name="prefixLength">5</property>
    <property name="mapFile">partition-pattern.txt</property>
</function>




<div class="se-preview-section-delimiter"></div>

partition-pattern.txt
# range start-end ,data node index
# ASCII
# 8-57=0-9 阿拉伯数字123
# 64、 65-90=@、 A-Z
# 97-122=a-z
###### first host configuration
1-4=0
5-8=1
9-12=2
13-16=3
###### second host configuration
17-20=4
21-24=5
25-28=6
29-32=7
0-0=7

The above columns identify the table field to be fragmented, the algorithm fragmentation function, patternValue is the modulus cardinality, prefixLength
is the number of bits to be intercepted by ASCII

This method is similar to method 6, except that the sum of all ASCII codes in the prefixLength bit before the column species is obtained modulo
sum%patternValue, and the obtained value is the number of slices within the range.

8 Application specific

8.1 Official Documentation

This rule is that the application decides which shard to route to at runtime.

<tableRule name="sharding-by-substring">
    <rule>
      <columns>user_id</columns>
        <algorithm>sharding-by-substring</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-substring" class="org.opencloudb.route.function.PartitionDirectBySubString">
    <property name="startIndex">0</property><!-- zero-based -->124
    <property name="size">2</property>
    <property name="partitionCount">8</property>
    <property name="defaultPartition">0</property>
</function>

The above columns identify the table fields to be sharded, the algorithm sharding function, this method calculates the partition number directly based on the character substring (must be a number) (parameters are passed by the application, and the partition number is explicitly specified).
For example, id=05-100000002, in this configuration, it means starting from startIndex=0 according to the id, intercepting siz=2 digits, that is, 05, 05 is the obtained partition, if not passed, it will be assigned to defaultPartition by default

9 Intercept digital hash parsing

9.1 Official Documentation

This rule is to intercept the int value hash fragment in the string.

<tableRule name="sharding-by-stringhash">
    <rule>
        <columns>user_id</columns>
        <algorithm>sharding-by-stringhash</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-stringhash" class="org.opencloudb.route.function.PartitionByString">
    <property name="partitionLength">512</property><!-- zero-based -->
    <property name="partitionCount">2</property>
    <property name="hashSlice">0:2</property>
</function>

/**
* “2” -> (0,2)
* “1:2” -> (1,2)
* “1:” -> (1,0)
* “-1:” -> (-1,0)
* “:-1” -> (0,-1)125
* “:” -> (0,0)
*/

10 Consistent hash

10.1 Official Documentation

Consistent hash budget effectively solves the problem of scaling distributed data.

<tableRule name="sharding-by-murmur">
    <rule>
    <columns>user_id</columns>
    <algorithm>murmur</algorithm>
</rule>
</tableRule>
<function name="murmur" class="org.opencloudb.route.function.PartitionByMurmurHash">
    <property name="seed">0</property><!-- 默认是 0-->
    <property name="count">2</property><!-- 要分片的数据库节点数量，必须指定，否则没法分片-->
    <property name="virtualBucketTimes">160</property><!-- 一个实际的数据库节点被映射为这么多虚拟节点，默认是 160 倍，也就是虚拟节点数是物理节点数的 160 倍-->
<!--
<property name="weightMapFile">weightMapFile</property>
节点的权重，没有指定权重的节点默认是 1。以 properties 文件的格式填写，以从 0 开始到 count-1 的整数值也就是节点索引为 key，以节点权重值为值。所有权重值必须是正整数，否则以 1 代替
-->
<!--
<property name="bucketMapPath">/etc/mycat/bucketMapPath</property>
用于测试时观察各物理节点与虚拟节点的分布情况，如果指定了这个属性，会把虚拟节点的 murmur hash 值与物理节点的映射按行输出到这个文件，没有默认值，如果不指定，就不会输出任何东西
-->
</function>

Suppose there are 10 shard nodes, the data volume is 10 million, and the primary key starts to increase from 1 million.
Execute the main method of PartitionByMurmurHash, and the following results will be obtained according to the above assumptions

index bucket ratio
 0 1001836 0.1001836
 1 1038892 0.1038892
 2 927886 0.0927886
 3 972728 0.0972728
 4 1086100 0.10861
 5 908616 0.0908616
 6 1024269 0.1024269
 7 1018029 0.1018029
 8 995581 0.0995581
 9 1026063 0.1026063

The first column is the number of the sharding node, the second column is the amount of data hashed to each node, and the third column is the ratio of the amount of data from each hash to each node to the total amount of data. The sum of the third column is 1.0 and the sum of the fifteenth column is 10000000. If the amount of data is quite small, it will be found that the distribution of consistent hashing is not uniform enough, but as long as the amount of data is more than 10,000, the distribution ratio of consistent hashing can be maintained at about 0.1. the closer the amount.
Now suppose two new nodes are added and rehash is executed on node 0, the following results will appear

 index bucket ratio
 0 853804 0.8522392886660092
 1 0 0.0
 2 0 0.0
 3 0 0.0
 4 0 0.0
 5 0 0.0
 6 0 0.0
 7 0 0.0
 8 0 0.0
 9 0 0.0
 10 70075 0.06994657808264028
 11 77957 0.07781413325135052

The meaning of the first and second columns is the same as the previous set of lists, and the third column is the ratio of the amount of data hashed to the current node to the total amount of data of the original node 0. As can be seen from the above list, the original node 0 has 1001836 data. After the rehash, most of the data is still hashed to No. 0. A small amount of data is hashed to two new nodes No. 10 and 11. Other old nodes do not get the original data on No. 0. data. In fact, no matter how many nodes are added, the rehash result of the data will show this rule: when the data of the existing node is rehashed, there are only two possible places, either the node before the rehash or the newly added node, which is also consistent hashing the meaning of.
Using this slice method can ensure that the data is migrated as little as possible during rehash.

10.2 Personal Notes

How to rehash??

11 Split by month and hour

11.1 Official Notes

This rule is to split by hour within a month. The minimum granularity is hour. There can be a maximum of 24 shards a day, and a minimum of 1 shard. After one month, the next month
starts from the beginning.
At the end of each month, data needs to be cleaned manually.

<tableRule name="sharding-by-hour">
    <rule>
        <columns>create_time</columns>
        <algorithm>sharding-by-hour</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-hour" class="org.opencloudb.route.function.LatestMonthPartion">
    <property name="splitOneDay">24</property>
</function>

splitOneDay : the number of shards to split in a day

LatestMonthPartion partion = new LatestMonthPartion();
partion.setSplitOneDay(24);
Integer val = partion.calculate("2015020100");
assertTrue(val == 0);
val = partion.calculate("2015020216");
assertTrue(val == 40);
val = partion.calculate("2015022823");
assertTrue(val == 27 * 24 + 23);
Integer[] span = partion.calculateRange("2015020100", "2015022823");
assertTrue(span.length == 27 * 24 + 23 + 1);
assertTrue(span[0] == 0 && span[span.length - 1] == 27 * 24 + 23);
span = partion.calculateRange("2015020100", "2015020123");127
assertTrue(span.length == 24);
assertTrue(span[0] == 0 && span[span.length - 1] == 23);

12 Range modulo slicing

12.1 Official Documentation

The range sharding is performed to calculate the sharding group first, and then the modulo is calculated within the group. The
advantages can avoid data migration during expansion, and can avoid the hotspot problem of range sharding to a certain extent.
The advantages of range sharding and modulo sharding are combined. Using modulo within a shard group can ensure that the data in the group is relatively uniform, and the range shards between shard groups
can take into account range queries.
It is best to plan the number of shards in advance, and expand the data by shard group, so the data of the original shard group does not need to be migrated. Since the data in the shard group is relatively uniform, the problem of hot data in the shard group can be avoided.

<tableRule name="auto-sharding-rang-mod">
    <rule>
        <columns>id</columns>
        <algorithm>rang-mod</algorithm>
    </rule>
</tableRule>
<function name="rang-mod" class="org.opencloudb.route.function.PartitionByRangeMod">
    <property name="mapFile">partition-range-mod.txt</property>
    <property name="defaultNode">21</property>
</function>

The columns above identify the table fields to be sharded, the algorithm sharding function, and in the
rang-mod function, mapFile represents
the default node sequence number after the configuration file path defaultNode exceeds the range, and the node starts from 0.
partition-range-mod.txt
range start-end , data node group size
The following configures a range to represent a shard group, and the number after the = sign represents the number of shards owned by the shard group.
0-200M=5 //represents 5 shard nodes
200M1-400M=1
400M1-600M=4
600M1-800M=4
800M1-1000M=6

13 date range hash sharding

13.1 Official Documentation

The idea is consistent with the range modulo. When the date is in the modulo, there will be a problem in the data set, so it is changed to the hash method.
First grouping according to date, and then hashing according to time to make data distribution more even in the short term
Advantages: It can avoid data migration during expansion, and to some extent avoid the hotspot problem of range fragmentation. 128 The
date format is required to be as accurate as possible, otherwise it will not reach local homogeneity

<tableRule name="rangeDateHash">
    <rule>
        <columns>col_date</columns>
        <algorithm>range-date-hash</algorithm>
    </rule>
</tableRule>
<function name="range-date-hash" class="org.opencloudb.route.function.PartitionByRangeDateHash">
    <property name="sBeginDate">2014-01-01 00:00:00</property>
    <property name="sPartionDay">3</property>
    <property name="dateFormat">yyyy-MM-dd HH:mm:ss</property>
    <property name="groupPartionSize">6</property>
</function>

14 Hot and cold data sharding

14.1 Official Documentation

Query the distribution of hot and cold data of log data according to the date, query the real-time transaction database for the last n months, and divide it according to m days for more than n months.

<tableRule name="sharding-by-date">
    <rule>
        <columns>create_time</columns>
        <algorithm>sharding-by-hotdate</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-hotdate" class="org.opencloudb.route.function.PartitionByHotDate">
    <property name="dateFormat">yyyy-MM-dd</property>
    <property name="sLastDay">10</property>
    <property name="sPartionDay">30</property>
</function>

15 natural monthly sharding

15.1 Official Documentation

Partition by month column, one slice per natural month, an example of format between operation parsing.

<tableRule name="sharding-by-month">
    <rule>
        <columns>create_time</columns>
        <algorithm>sharding-by-month</algorithm>
    </rule>
</tableRule>
<function name="sharding-by-month" class="org.opencloudb.route.function.PartitionByMonth">
    <property name="dateFormat">yyyy-MM-dd</property>
    <property name="sBeginDate">2014-01-01</property>
</function>

PartitionByMonth partition = new PartitionByMonth();
partition.setDateFormat("yyyy-MM-dd");
partition.setsBeginDate("2014-01-01");
partition.init();
Assert.assertEquals(true, 0 == partition.calculate("2014-01-01"));
Assert.assertEquals(true, 0 == partition.calculate("2014-01-10"));
Assert.assertEquals(true, 0 == partition.calculate("2014-01-31"));
Assert.assertEquals(true, 1 == partition.calculate("2014-02-01"));
Assert.assertEquals(true, 1 == partition.calculate("2014-02-28"));
Assert.assertEquals(true, 2 == partition.calculate("2014-03-1"));
Assert.assertEquals(true, 11 == partition.calculate("2014-12-31"));
Assert.assertEquals(true, 12 == partition.calculate("2015-01-31"));
Assert.assertEquals(true, 23 == partition.calculate("2015-12-31"));

Secondary development

Develop new partition rules
Create a new class that inherits AbstractPartitionAlgorithm and implements RuleAlgorithm. Override public void init() and public Integer calculate(String columnValue).
init initializes the sharding rule according to the sharding initialization parameters specified in rule.xml, and calculate(String) receives the string form of the sharding field to calculate the node where the record should be saved.

total notes

Todo:
- Experiment with fixed shard Hash
- Look at the source code of the scope agreement, mainly how to convert binary. Test the specific sharding effect
- go to see the source code of the date sharding
- try the use of the sEndDate of the date sharding
- why the consistent hash budget can solve the expansion problem? - Why do I need to manually clean the data
in the hourly shards in a single month ? Where does the cleanup go?

Note:
- After changing the rules, you need to restart mycat to take effect, or use it in mysql"
- When using Insert, you need to specify ColumnList, which cannot be hidden
- The column corresponding to the fixed sharding Hash algorithm only supports numbers
- Cannot be used for the sharding algorithm Line to update
- the algorithm class path written in the document is inconsistent with the actual path , the new path starts with "io.mycat."
- According to the date sharding does not need to explicitly define the number of shards, the specific number of shards is determined by sBeginDate, sEndDate and sPartionDay are determined by three, the first two determine the total number of days, and the third party decides to change a shard every x days.
After sEndDate arrives, it will be allocated from scratch

MyCat sharding algorithm learning (pure transfer)