[MongoDB] Sharding concept and principle

In the face of massive data, how to improve the efficiency of data reading and writing? Anyone who knows DB should know the partition of the database. Partition is mainly to improve the efficiency of data query. In the face of the IO capacity limitation and scalability of massive data, Sharding is a powerful tool. We don't need to care about the specific implementation, so as to achieve performance improvement.

Sharding is not a technology, but a concept to solve the horizontal expansion of data and break through the IO limit of a single node. Many mature NoSQL and NewSQL have built-in sharding implementations. The user directly specifies the fields that need to be sharded, and the cluster can shard by itself.

Sharding is short for share nothing. In different areas, data does not interfere with each other. Different from traditional partitioning, database partitioning is a technology implemented by physical database. The purpose is to reduce the read and write range of SQL design when operating the database, and improve the time to read data. Sharding is to process queries in parallel in the face of distributed storage.

Sharding is responsible for one or more servers in a subset of data. If there are multiple servers, each server is a copy of the data. Usually, we can set the number of copies in the replica set. Generally speaking, after defining the shard key field, there are three ways of sharding:

  1. by number

  2. hash modulo

  3. Custom

Many nosql and newSQL have implemented sharding technology. The following takes MongoDB and Mysql cluster as examples to introduce in detail.

MongoDB

After specifying the shard key (shard key), the mongoDB cluster will partition the data by itself, and shard according to the amount of data. At the same time, he divides each shard into intervals, and each interval stores different ranges of data. The purpose of this is to transfer data reasonably and reduce the network overhead caused by the transfer of large amounts of data, which affects usage. Usually the block definition size is 200M. If the definition is too large, it will cause network overhead caused by large mobile data, and if it is too small, it will bring management overhead.

The work of mongoDB in moving data to balance it is done by a balancer process, which is done automatically, and you don't need to care about the details. The goal of the balancer is not only to balance, but also to minimize the amount of migration.

Users will not directly access mongo shards. Second, they will access mongos through the terminal mongos of mongoDB, which will transfer all requests to the corresponding shards, and all configurations are stored on the mongod configuration server. mongod is lightweight Process, which can usually run on any mongo cluster server.



When using a mongo cluster, you need to pay attention to the statistics of the data. If a shard move is in progress, the data will be saved in both shards, and the data will not be deleted on the original shard until the move is completed. Therefore, if the sharding is unreasonable and causes frequent movement, it is very likely that a lot of redundant data will be counted.

In addition, if the data needs to be uniquely verified for more than one field, we usually set a field index and shard according to the field index. At this time, for another field that needs uniqueness, it is necessary to check whether the data has been written. Existing, assuming that there are two application processes, which are detected at the same time, will get no results at this time, and then write at the same time, resulting in duplicate fields. The solution is to lock all clusters before writing, but it will seriously affect the performance of the cluster.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324473699&siteId=291194637
Recommended