Introduction to Lucene Dynamic Sharding

In recently developed search engines, the index needs to be sharded. According to the needs of the project, we provide two sharding methods. Record the process blog.


Hash algorithm

The principle is very simple, the shard where it is located is determined by the hash value of the row key (_id), and then the operation is performed.

For example, there is now an index, and 5 shards are initialized, namely shard0, shard1, shard2, shard3, and shard4.
Now you need to save a row of data, _id is 0001000000123, the HashCode of _id is 1571574097, and the remainder of 5 (1571574097 % 5) is 2, so as to determine that the data should be saved in shard2. Here is a simple illustration:

QQ screenshot 20180428155347.png


The sharding implementation of the Hash algorithm is very simple, and the calculation process only needs to know the number of shards to complete the positioning. But precisely because the number of shards is part of the algorithm, modifying the number of shards is also very expensive.


One solution is to rearrange, such as increasing from M shards to N shards, first dividing each shard into N small shards, and then merging all small shards into large shards. Copied an illustration from the web,

2.png

The advantage of this approach is that the number of new shards can be arbitrarily set. The downside is that all the data needs to be rearranged, which can be time-consuming if the amount of data is large.


Of course, since the growth of project data is unpredictable, we did not choose the above method of adding films, but chose another method of adding films.


Dynamic sharding

Combined with the principle of Hash algorithm and binary tree, dynamically increase the sharding.


First, the Hash algorithm is the same as before. When creating a search, you can set an initial number of shards, for example, initialize 5 shards, namely shard_0, shard_1, shard_2, shard_3, and shard_4. When adding data, use the hash value of _id to determine which shard the data needs to be saved to. The difference is that we set the maximum number of rows for each shard. When the number of a shard reaches the maximum number of rows, the shard will be split into two small shards, which will be used as the current shard’s subshard.

For example, set the maximum number of rows in a shard to 10 million. When shard_2 exceeds 10 million, it will be split into two sub-shards, shard_2_2 and shard_2_7. If the data of shard_2_2 continues to grow to 10 million, the sub-shards shard_2_2_2 and shard_2_2_12 will be split.


As can be seen from the example, the splitting is not random. Assuming that the initial number of shards is m, and k represents the depth of the binary tree, the splitting rule of shard n is:

shard_n is split into shard_n_n and shard_n_(n + m * 1)

shard_n_n split into shard_n_n_n and shard_n_n_(n + m * 2)

shard_n_(n + m * 1) split into shard_n_(n + m * 1)_(n + m * 1) and shard_n_(n + m * 1)_(n + m * 1 + m * 2)

...



The above formula looks complicated, we use a diagram to illustrate the splitting process

4.png


If you still don't understand, we can use _id to find the corresponding shard to sort out the idea, or the above example,

A row of data needs to be saved, _id is 0001000000123, the HashCode of _id is 1571574097, and the remainder of 5 (1571574097 % 5) is 2, thus determining that the data should be saved in shard_2.

shard_2 has been split into two sub-shards, shard_2_2 and shard_2_7. The cardinality of this layer is 10 (cardinality = number of initialized shards * number of layers), we will 1571574097 remainder 10 (1571574097 % 10) to 7, then the data will be saved in shard_2_7.

shard_2_7 has no sub-shards, which means that the shard is not split and can be stored directly in this shard.


Analyze the principle of sharding:

  1. Find the shard according to the hash algorithm;

  2. If the shard has sub-shards, look it up from the sub-shards;

  3. If the shard has no sub-shards, the data is stored in that shard;


Let's analyze the shard splitting rules. Why is shard_1 split into shard_1_1 and shard_1_6?

The reason is very simple. shard_1 indicates that the hash value of the id is 1 after the remainder of 5. If shard_1 is split into 2 parts, the cardinality of the second layer is 10 = the cardinality of the previous layer * 2, that is, 5 * 2. The remainder of 5 is 1, then the remainder of 10 will only be 1 and 6, so

shard_1 is split into shard_1_1 and shard_1_6.


data consistency

Dynamic sharding is automatic sharding during use, and the sharding process will be very long. After testing, the index of 32 columns and 5 million rows is split into two sub-shards, which takes 245 seconds. Splitting process If the original data is modified, these modifications may be lost. Therefore, certain measures are required to ensure data security during the splitting process.


The first method is to use pessimistic locking.

    Locked shards cannot be modified before the split, until after the split is complete.

    Advantages: The logic is simple and rude, and the development difficulty is low.

    Disadvantage: Locking for too long may cause a large number of abnormal requests to be generated by the calling server.


The second method is to use the transaction log.

    A transaction log is created before the split, and all new, modified, and deleted operations on the current shard are written to the transaction log. After the split is complete, lock the shard and sub-shard, restore data from the transaction log to the sub-shard, and then unlock it.

    Advantages: The shards are locked only when the transaction log is created and data is restored, the locking time is short, and the service caller is hardly affected.

    Disadvantages: The development is difficult, and a set of transaction logs and log recovery operation interfaces need to be developed. However, the underlying lucene storage already has a set of transaction log interfaces and implementations, and this shortcoming can almost be ignored.




RowKey Incrementing Sharding

    If the row key of the saved data is incremented as a whole, for example, the row key is 000000001, 000000002, 000000003, ... in this format, you can shard by row key. This sharding implementation is relatively simple.


    1. Set an initial shard when creating an index;

    2. In the process of adding data, and record the minimum value minId and maximum value maxId of the shard row key;

    3. When the amount of shard data exceeds the set maximum value, a new shard is created, and the new data is saved in this shard;

    4. When updating data, determine the shard where it is located by comparing it with the minId and maxId of each shard.


Comparison of row key incremental sharding and Hash algorithm sharding:

    1. The row key incremental sharding method is simpler to implement, and the development cost is lower;

    2. Row key incremental sharding locates shards by minId and maxId, if each shard information needs to record the minId and maxId of the shard;

    3. When the row key is incremented and sharded to store data, it needs to be stored in a certain order, otherwise the data may be skewed;

    4. Row key incremental sharding Add shards as needed, only need to set the maximum number of rows for each shard, there is no splitting process;

    5. Row key incremental sharding puts a lot of pressure on the latest shard, and the Hash algorithm sharding pressure is distributed to each shard. In theory, the Hash algorithm sharding can support higher throughput.























Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325046080&siteId=291194637