Distributed generation ID summary

First, the demand for origin

Almost all business systems, it has generated a record identifying needs, such as:

(1) identifies the message: message-id

(2) order identifier: order-id

(3) Post logo: tiezi-id

The database record identification is often the unique primary key , creates clustered indexes (cluster index) database, i.e., ordered in this field on the physical storage.

 

This query on record identification, often have paged or sort of business needs, such as:

(1) a pull in the message: selectmessage-id / order by time / limit 100

(2) pull the latest chapter in the order: selectorder-id / order by time / limit 100

(3) to pull the latest chapter in the post: selecttiezi-id / order by time / limit 100

So often have a time field, and establishes the general index (non-cluster index) in the time column.

 

We all know that the general index of pointers actually recorded, the access efficiency will be slower than the clustered index, if the record identifier when you build can basically in chronological order, this time the index fields can save the query:

select message-id/ (order by message-id)/limit 100

Again, we can do so on the premise that generate message-id is the basic trend of increasing time .

 

This leads to the generation identification record (i.e. three above mentioned XXX-id) of the two core requirements:

(1) Globally Unique

(2) trends in order

This is also the core issues discussed in this article: how to efficiently and orderly generate trend globally unique ID.

 

Second, the common method, insufficient and optimization

[Common one: auto_increment to generate a global database using a unique ID increment]

advantage:

(1) simple, functional use of the existing database

(2) to ensure uniqueness

(3) to ensure the incremental

(4) fixing step

Disadvantages:

(1) it is difficult to ensure the availability of: a common database schema is a master from multiple separate read and write +, is generated from the write request by ID, hung up on the main reservoir could not handle

(2) poor scalability, performance limit: is written as a single point, the write performance of the database to generate the primary library determine the upper limit of the performance of the ID, and difficult to expand

ways to improve:

(1) increase the master library, to avoid a single point write

(2) segmentation of data levels, to ensure that all main library generated ID will not be repeated


The above figure, the write library into a three write library, a different set for each write library auto_increment initial value, and the same incremental steps , to ensure that the database generated for each ID is different (figure above 0 ... libraries generated 0,3,6,9, 1,4,7,10 generate a library, the library generating 2,5,8,11 ... 2)

The improved framework to ensure the availability, but the disadvantage is:

(1) the loss of a generation ID "absolute incremental": first visit libraries generated 0,3 0, 1 and then visit the library to generate 1, may result in a very short period of time, ID generation is not absolute incremental (not the problem our goal is the trend of incremental, not absolute increments)

(2) to write the database is still great pressure, have each generated ID to access the database

To solve the above two problems, leads to the second common scenario

 

[Common Method two: a single point of service generated batch ID]

The reason why the distributed systems is difficult, one very important reason is "not a global clock, difficult to guarantee timing", in order to guarantee the timing, or only use a single point of service, with a local clock to ensure "absolute timing" . Database writing pressure, because every generation ID have access to the database, you can use the bulk of the ways to reduce the pressure to write database.


The above figure, using a double master database to ensure the availability, the database stores only the maximum value of the current ID, for example, 0. Generating a service ID is assumed that each pulling six batches ID, service access to the database, modifying the maximum value of the current ID is 5, so that the application service access ID generation request ID, the service ID does not need to generate the database each time, can be sequentially distributed these 0,1,2,3,4,5 ID, the ID assigned when finished, and then modify the maximum ID is 11, the ID can again distribute the 6,7,8,9,10,11, so the pressure on the database down to 1/6 of the original.

Advantages :

(1) ensure that the ID generated absolute increment ordered

(2) greatly reduces the pressure on the database, ID generation can be done to generate tens of thousands to hundreds of thousands per second

Disadvantages :

(1) remains the single point of service

(2) If, after the service hung up, restart up service, continue to be generated ID will not be continuous, intermediate cavity appears (0,1,2,3,4,5 service memory is preserved, max-id database 5, 3 to assign the service to restart, next time will be allocated from 6, 4, and 5 became empty, but the problem is not large)

(3) can generate tens of thousands per second, although hundreds of thousands of ID, but after all, there are performance limit, not expand horizontally

Improvement :

Common availability optimization is a single point of service "alternate service", also called "shadow service", so we can use the following methods to optimize the above-mentioned disadvantages (1):


As shown above, the main service provided by the external service, a service shadow time in the standby state, when the main service when the shadow of the hang top service. This switching process is transparent to the caller, can be automated, it is a common technique vip + keepalived, particularly not here expanded.

 

[Common Method three: uuid]

ID scheme described above is generated, though significant increase in performance, but since the system is a single point, there is still the overall performance limit. At the same time, the two schemes, whether or service database to generate the ID, the business side Application need to be a remote call, time-consuming. There is no method for locally generated ID, that is high performance, and low latency it?

uuid is a common scenario: string ID = GenUUID ();

Advantages :

(1) a locally generated ID, remote call does not require low delay

(2) scalability, performance basically that there is no upper limit

Disadvantages :

(1) can not guarantee that trend is incremented

(2) uuid long, often represented by the string, as a primary key index query establish a low efficiency, a common optimization is "into two uint64 integer store" or "binary memory" (binary uniqueness can not be guaranteed after)

 

[Common four: take the current number of milliseconds]

uuid is a local algorithm to generate high performance, but can not guarantee that trend increase, and as a string ID retrieval efficiency is low, is there a guarantee local algorithm increasing it?

The current take several milliseconds is a common scheme: uint64 ID = GenTimeMS ();

Advantages :

(1) a locally generated ID, remote call does not require low delay

(2) incrementing generated ID tendency

(3) generated ID is an integer, the index high query efficiency

Disadvantages :

(1) If the amount of concurrency than 1000, will generate duplicate ID

I went to, this shortcoming to the life, and can not guarantee the uniqueness of the ID. Of course, the use of microseconds can reduce the probability of conflict, but can generate up to one million ID per second, no amount of words will certainly be conflict, so the use of microseconds does not solve the problem fundamentally.

 

[Common five: class snowflake algorithm]

snowflake is open twitter distributed generation algorithm ID, which core idea is: a long-type ID, which use as SEQ ID 41bit milliseconds milliseconds, as the machine number 10bit, 12bit as. This algorithm is a stand-alone within theoretically can generate up to 1000 per second * (2 ^ 12), which is 400W of ID, can meet the needs of the business.

Learn snowflake ideas, combining the business logic and concurrency companies can realize their distributed ID generation algorithm.

For example, assume that a company ID generator service requirements are as follows:

(1) single peak concurrency of less than 1W, the next five years is expected to peak concurrent single amount is less than 10W

(2) There are two rooms, the next five years, the number of machine room less than four

(3) the number of each machine room less than 100

(4) there are five operational lines are generated demand ID, the number of expected future business line of less than 10

(5)…

Analysis process is as follows:

(1) high taken from January 1, 2016 to the number of milliseconds (assuming the system ID Builder service on the line after this time), it is assumed that the system is running at least 10 years, and that at least 10 years * 365 days * 24 hours * 3600 seconds = 320 ms * 1000 * 10 ^ 9, similar to the reserved number of milliseconds 39bit

(2) concurrent single peak per second is less than 10W, i.e. per single peak concurrent ms is less than 100, almost every millisecond 7bit reserved sequence number to

Number (3) is less than 5 years, the engine room 4, the room reserved for identification 2bit

(4) Each machine room less than 100, 7bit reservation ID to each server room

(5) less than 10 lines of business, to the service line identifier reserved 4bit


Such 64bit logo design, can guarantee:

(1) each service line, each room, each machine-generated ID is different

(2) the same machine, each millisecond generated ID is different

(3) the same machine, the same one millisecond, to distinguish the sequence numbers ensure that the generated area ID is different

(4) The highest bit in milliseconds, to ensure the generated ID is incremented trends

Disadvantages :

(1) Because "there is no global clock", each server assigned ID is absolutely increasing, but from a global perspective, the generated ID is only increasing trend (some time earlier server, the server some time later)

The last question is easy to ignore :

ID generated, for example, message-id / order-id / tiezi-id, when the amount of data often requires sub-library sub-table, which are often used as the basis ID modulo sub-library sub-table, the sub-table for the sub-library data uniformly , demand often generate ID "modulo randomness", so we usually placed second sequence number within the last bits of the ID, the ID is guaranteed to produce random.

And if, when we cross ms, the sequence number is always reset to 0, so that the sequence number will be relatively large ID of 0, resulting in the generated ID modulo unevenness. The solution is not always the property of the sequence number 0, but owned by a random number of 0-9, this place.

Published 136 original articles · won praise 6 · views 1515

Guess you like

Origin blog.csdn.net/weixin_42073629/article/details/104603429