MongoDB range sharding and hash sharding

Digression

A few days ago, someone asked how MongoDB's hash sharding was implemented. When I reacted, I found that my memory was a little fuzzy, so I briefly sorted this article for record.

1. Introduction to MongoDB Sharding

(1) Fragmentation principle
The sharding of data in MongoDB is based on a collection, and the data in the collection is divided into multiple parts by the shard key. Sharding is where MongoDB divides large collections into different servers or different clusters. Although sharding originated from relational database partitioning, MongoDB sharding is another matter entirely.

Compared with the MySQL partitioning scheme, the biggest difference of MongoDB is that it can do almost everything automatically. As long as you tell MongoDB to allocate data, it can automatically maintain the balance of data among different servers.

(2)
Database applications with high data volume and throughput for sharding will put a lot of pressure on the performance of a single machine. A large query volume will exhaust the CPU of a single machine. A large amount of data puts a greater pressure on the storage of a single machine. Will exhaust the system memory and transfer the pressure to disk IO.
In order to solve these problems, there are two basic methods: vertical expansion and horizontal expansion.

Vertical expansion: Add more CPU and storage resources to expand capacity.
Horizontal expansion: the data set is distributed on multiple servers. Horizontal expansion is sharding.

Sharding provides a method to deal with high throughput and large data volume. Using sharding reduces the number of requests that need to be processed per shard.
Therefore, through horizontal expansion, the cluster can increase its storage capacity and throughput. For example, when inserting a piece of data, the application only needs to access the shard that stores the data; using sharding reduces the data stored in each shard.

(3) shard key (shard key)
of the collection time slice, you need to select a slice key , shard key is to be included in each record, and the establishment of a single field or a composite index field, MongoDB key sheet according to Divide the data into different data blocks, and distribute the data blocks evenly among all the shards. In order to divide the data blocks by the photo key, MongoDB uses range-based sharding or hash-based sharding.

(4) Data balance
When the imbalance of data in the cluster occurs, the balancer will migrate the data block from the shard with the largest number of data blocks to the shard with the least data block. For example: if the set of users is on shard 1 There are 100 data blocks and 50 data blocks on shard 2. The equalizer will migrate the data blocks from shard 1 to shard 2 until the data is balanced.

Two, two kinds of slice introduction

Range-based sharding

For range-based sharding, MongoDB divides the data into different parts according to the range of the photo key.
Suppose there is a numeric shard key: imagine a straight line from negative infinity to positive infinity, and the value of each shard key is drawn on the line One point. MongoDB divides this straight line into shorter non-overlapping segments, which are called data blocks, and each data block contains data with a slice key within a certain range.

In a system that uses shard keys for range division, documents with "similar" shard keys are likely to be stored in the same data block, and therefore will also be stored in the same shard.

Range-based sharding

For hash-based sharding, MongoDB calculates the hash value of a field and uses this hash value to create a data block.

In a system that uses hash-based sharding, documents with "similar" shard keys are likely not to be stored in the same data block, so data separation is better.

Three, summary

Range-based sharding provides a more efficient range query. Given a range of shard keys, the distribution route can easily determine which data block stores the data required by the request, and forward the request to the corresponding shard.

However, range-based sharding can cause data imbalance on different shards. Sometimes, the negative effect will be greater than the positive effect of query performance. For example, if the field where the shard key is located increases linearly, a certain period of time All requests of will fall into a fixed data block, which will eventually be distributed in the same shard. In this case, a small number of shards carry most of the data of the cluster, and the system cannot perform well Extension.

In contrast, the hash-based sharding method guarantees the balance of data in the cluster at the cost of the loss of range query performance. The randomness of the hash value makes the data randomly distributed in each data block, so it is also randomly distributed In different shards. However, due to the randomness, it is difficult for a range query to determine which shards should be requested. Generally, all shards need to be requested in order to return the required results.

Oh, not bad! ------Welcome to point out the mistakes and add better methods

Guess you like

Origin blog.csdn.net/Tah_001/article/details/108690215
Recommended