Talk in detail about ID generation methods in distributed systems


1. Origin of demand

Almost all business systems have the need to generate a record identification, such as:
(1) Message identification: message-id
(2) Order identification: order-id
(3) Post identification: tiezi-id
This record identification is often the database The unique primary key in the database will establish a clustered index (cluster index), that is, sort by this field on the physical storage.

The query on this record identification often has business requirements for paging or sorting, such as:
(1) Pull the latest page of messages: selectmessage-id/ order by time/ limit 100
(2) Pull the latest page of orders : Selectorder-id/ order by time/ limit 100
(3) Pull the latest page of posts: selecttiezi-id/ order by time/ limit 100,
so there is often a time field, and a common index is established on the time field (non -cluster index).

We all know that the ordinary index stores the pointer of the actual record, and its access efficiency will be slower than that of the clustered index. If the record identifier can be basically ordered by time when it is generated, the index query of this time field can be omitted:
select message-id / (order by message-id)/limit 100
Again, the premise for this is that the generation of message-id basically increases with the trend time.

This leads to the two core requirements for record identification generation (that is, the three XXX-ids mentioned above):
(1) Globally unique
(2) Trending orderly
This is also the core issue to be discussed in this article: how to generate efficiently A globally unique ID with an orderly trend.

2. Common methods, shortcomings and optimization

Method 1: Use the auto_increment of the database to generate a globally unique incremental ID

Advantages:
(1) Simple, use the existing functions of the database
(2) Can guarantee uniqueness
(3) Can guarantee incrementality
(4) Fixed step length
Disadvantages:
(1) Availability is difficult to guarantee: the common database architecture is one master and multiple slaves +Separation of reading and writing, the generation of self-incrementing ID is a write request, and the main library will not be able to play
(2) Poor scalability, and performance has an upper limit: because writing is a single point, the write performance of the main database determines the performance of ID generation The upper limit, and it is difficult to expand.
Improvement methods:
(1) Increase the main library to avoid writing a single point
(2) Data level division to ensure that the ID generated by each main library is not repeated

As mentioned in the figure above, one writing library becomes three writing libraries, and each writing library sets a different initial value of auto_increment and the same increase step length to ensure that the ID generated by each database is different (in the figure above The library 0 generates 0,3,6,9..., the library 1 generates 1,4,7,10, and the library 2 generates 2,5,8,11...) The
improved architecture guarantees usability, but the disadvantages are:
(1) Lost the "absolute incrementality" of ID generation: first access to library 0 to generate 0 and 3, and then access to library 1 to generate 1, which may result in ID generation not being absolutely incremental in a very short time (this problem is not big, our The goal is an increasing trend, not an absolute increase)
(2) The writing pressure of the database is still very large, and the database must be accessed every time an ID is generated.
In order to solve the above two problems, the second common solution is introduced

Method 2: Single-point batch ID generation service

One of the important reasons why the distributed system is difficult is that "it is difficult to guarantee absolute timing without a global clock." If you want to guarantee absolute timing, you can only use a single point service and use a local clock to ensure "absolute timing". . The database write pressure is high because the database is accessed every time the ID is generated, and the database write pressure can be reduced in batches.

As shown in the above figure, the database uses dual master to ensure availability, and only the maximum value of the current ID is stored in the database, such as 0. The ID generation service assumes that 6 IDs are pulled in batches each time, the service accesses the database, and the maximum value of the current ID is modified to 5, so that the application access ID generates the service request ID, and the ID generation service does not need to access the database every time, it can be distributed sequentially 0,1,2,3,4,5 these IDs, when the ID is sent, and then modify the maximum ID to 11, you can distribute the 6,7,8,9,10,11 IDs again. So the pressure on the database is reduced to 1/6 of the original.
Advantages:
(1) Ensure the absolute increasing order of ID generation
(2) Greatly reduce the pressure on the database, ID generation can generate tens of thousands of hundreds of thousands per second
Disadvantages:
(1) The service is still a single point
( 2) If the service hangs, after the service restarts, the ID generation may not be continuous, and there will be holes in the middle (the service memory is stored in 0,1,2,3,4,5, the max-id in the database is 5, and the allocation At 3 o'clock, the service restarts. Next time, it will be allocated from 6. 4 and 5 will become holes, but this problem is not big.)
(3) Although tens of thousands and hundreds of thousands of IDs can be generated per second, it is still There is a performance limit and horizontal expansion is not possible.
Improvement method:
The commonly used high-availability optimization solution for single point services is "backup service", also called "shadow service", so we can use the following methods to optimize the above shortcomings (1):

As shown in the figure above, the externally provided service is the main service, and there is a shadow service that is always in the standby state. When the main service goes down, the shadow service is on top. This switching process is transparent to the caller and can be done automatically. The commonly used technique is vip+keepalived, and the details will not be expanded here.

Method three: uuid

Although the performance of the above scheme to generate ID is greatly increased, because it is a single-point system, there is always a performance limit. At the same time, for the above two solutions, whether it is a database or a service to generate ID, the business side Application needs to make a remote call, which is time-consuming. Is there a way to generate ID locally that is high-performance and low-latency?
uuid is a common solution: string ID =GenUUID();
Advantages:
(1) ID is generated locally, no remote call is required, and the delay is low
(2) The scalability is good, basically it can be considered that there is no upper limit of performance
Disadvantages:
(1 ) Unable to guarantee the increasing trend
(2) uuid is too long, often expressed as a string, as the primary key to build an index query efficiency is low, the common optimization scheme is "convert to two uint64 integer storage" or "half storage" (unity cannot be guaranteed after half Sex)

Method 4: Take the current milliseconds

uuid is a local algorithm with high generation performance, but it cannot guarantee the trend increase, and the retrieval efficiency as a string ID is low. Is there a local algorithm that can guarantee the increase?
Taking the current number of milliseconds is a common solution: uint64 ID = GenTimeMS();
Advantages:
(1) Local generation of ID, no remote call, low latency
(2) The trend of
generated ID is increasing (3) The generated ID is Integer, the query efficiency is high after indexing.
Disadvantages:
(1) If the concurrency exceeds 1000, duplicate IDs will be generated.
Let me go. This shortcoming is terrible and the uniqueness of IDs cannot be guaranteed. Of course, the use of microseconds can reduce the probability of conflicts, but only 1,000,000 IDs can be generated per second. If there are more, there will be conflicts, so using microseconds does not fundamentally solve the problem.

Method 5: Snowflake-like algorithm

Snowflake is Twitter's open source distributed ID generation algorithm. Its core idea is: a long ID, using 41bit as the number of milliseconds, 10bit as the machine number, and 12bit as the serial number within milliseconds. This algorithm can theoretically generate up to 1000*(2^12) per second, which is 400W ID, which can fully meet the needs of the business.
Drawing lessons from the idea of ​​snowflake, combining the business logic and concurrency of each company, you can implement your own distributed ID generation algorithm.
For example, suppose the demand for ID generator service of a company is as follows:
(1) The peak concurrency of a single machine is less than 1W, and the peak concurrency of a single machine is expected to be less than 10W in the next 5 years.
(2) There are 2 computer rooms, and the number of computer rooms is expected to be less than 4 in the next 5 years.
(3) The number of machines in each computer room is less than 100
(4) There are currently 5 business lines that have ID generation requirements, and the number of future business lines is expected to be less than 10
(5)... The
analysis process is as follows:
(1) The high level is taken from 2016 1 The number of milliseconds from the 1st of the month to the present (assuming that the system ID generator service is online after this time), assuming that the system has been running for at least 10 years, it will take at least 10 years, 365 days and 24 hours, 3600 seconds 1000 milliseconds = 320*10^9, almost Reserve 39 bits for the number of milliseconds
(2) The peak concurrency of a single machine per second is less than 10W, that is, the average peak concurrency of a single machine per millisecond is less than 100, and almost 7 bits are reserved for the serial number per millisecond
(3) The number of computer rooms in 5 years is less than 4 One, reserved 2 bits for the machine room identification
(4) Each machine room is less than 100 machines, reserved 7 bits for the server identification in each machine room
(5) Business lines less than 10, reserved 4 bits for the business line identification

The 64bit logo designed in this way can guarantee:
(1) The ID generated by each business line, each computer room, and each machine is different
(2) The same machine, the ID generated within each millisecond is different
( 3) The same machine, within the same millisecond, distinguish the generated IDs by the serial number area to ensure that the generated IDs are different
(4) Put the number of milliseconds at the highest position to ensure that the generated IDs are trending.
Disadvantages:
(1) Because "no "A global clock", the ID assigned by each server is absolutely increasing, but from a global perspective, the generated ID is only a trend increasing (some servers are early, some servers are late). The
last issue that is easy to ignore:
generated ID, such as message-id/ order-id/ tiezi-id, when the amount of data is large, it is often necessary to sub-database and sub-table. These IDs are often used as the basis for modular sub-database sub-table. In order to make the data even after sub-database and sub-table, ID Generation often requires "modular randomness", so we usually put the serial number per second at the end of the ID to ensure that the generated ID is random.
And if, when we span milliseconds, the serial number always returns to 0, which will cause more IDs with a serial number of 0, resulting in uneven IDs after being modulated. The solution is that the serial number is not returned to 0 every time, but a random number from 0 to 9, this place.

Guess you like

Origin blog.csdn.net/datuanyuan/article/details/109058619
Recommended