Comparison of primary key generation solutions for distributed systems | JD Cloud technical team

UUID

​UUID (Universal Unique Identification Code) is an unordered string composed of 32 hexadecimal numbers , calculated through a certain algorithm. In order to ensure its uniqueness, the UUID specification defines elements including the network card MAC address, timestamp, namespace (Namespace), random or pseudo-random numbers, timing and other elements, as well as the algorithm for generating UUID from these elements. Generally speaking, the algorithm can guarantee that any UUID generated anywhere will not be the same, but this uniqueness is limited and can only be guaranteed within a specific range.

A very obvious feature of UUID is that it is relatively long. The format is as follows:

xxxxxxxx-xxxx-Mxxx-xxxx-xxxxxxxxxxxx
467e8542-2275-4163-95d6-7adc205580a9

The M position represents the version number. Since the standard implementation of UUID has 5 versions, it will only be 1, 2, 3, 4, and 5;

Introduction to each version

The five existing versions of UUID are divided according to different usage scenarios rather than accuracy, so Version 5 will not be more accurate than Version 1. In terms of accuracy, everyone can guarantee uniqueness, and the probability of duplication is close to 0 .

Summarize:

  1. Using UUID, everyone can create a unique value that does not conflict with others and can be regarded as a unique identifier in all space and time.
  2. UUID can be generated by a stand-alone machine, and the generation speed is fast and the QPS is high. Each language has corresponding generation for direct calling.
  3. If you just need to generate a unique ID, you can use V1 or V4. v1 is based on timestamp and Mac address. These IDs have certain patterns and will expose your Mac address. v4 is completely random (pseudo).
  4. If you need to output the same UUID for the same parameters, you can use V3 or V5.

Version1: Implementation based on timestamp and MAC address

This includes a 48-bit MAC address and a 60-bit timestamp. In order to ensure uniqueness, v1 will use a 13~14-bit clock sequence to extend the timestamp when the time accuracy is not enough. For example, when the UUID production rate is too fast and exceeds the accuracy of the system time. The low-order part of the timestamp will be +1 for each additional UUID to simulate a higher-precision timestamp. In other words, when the system time precision does not distinguish the time sequence of two UUIDs, in order to ensure uniqueness, it will +1 on one of the UUIDs. Therefore, the probability of UUID duplication is almost 0. The timestamp plus the extended clock sequence have a total of 74 bits (2 to the 74th power, which is about 1.8 followed by 22 zeros). That is, under each node, 163 billion can be generated per second. Unique UUIDs.

However, since the last 12 bits in v1 are the MAC address of the network card, it will lead to privacy and security issues. This is where this version of UUID has been criticized.

Version2: DCE secure UUID

​ DCE (Distributed Computing Environment) secure UUID has the same algorithm as time-based UUID, but the first 4 positions of the timestamp will be replaced by POSIX UID or GID. This version of UUID is rarely used in practice.

Version3: 5 based on namespace and name

Both v3 and v5 are generated by calculating the hash value of the namespace and name. The difference is that the hash algorithm used by v3 is MD5, and v5 uses SHA-1. Because there is no uncertain part in the algorithm, when the namespace and name are determined, the UUID obtained is definitely unique. for example:

$ uuid -n 3 -v3 ns:URL www.jd.com
7e963853-8fce-3085-bb2c-8424745d73a2
7e963853-8fce-3085-bb2c-8424745d73a2
7e963853-8fce-3085-bb2c-8424745d73a2

In the implementation of the algorithm, the namespace and input parameters will be spliced ​​together, the hash result will be calculated, and operations such as truncation and formatting will be performed to ensure uniqueness.

Version4: Based on random numbers

​The 4 digits in the UUID of v4 represent the version, and the 2-3 digits represent the variant. The remaining 122-121 bits are all random. That is, there are 2 to the power of 122 (5.3 followed by 36 0s) UUIDs. A standard implemented UUID library has only a 50% probability of generating duplicate UUIDs after generating 2.71 trillion UUIDs. This is equivalent to generating 1 billion UUIDs per second for 85 years. To store these UUIDs in files, each UUID occupies 16 bytes, which requires a total of 45 EB (exabytes), which is many times larger than the current largest database (PB). .

Using v4 in java:

# java 1.5+ 
# java.util.UUID

for (int i = 0; i < 3; i++) {
	String uuid = UUID.randomUUID().toString();
	System.out.println(uuid);
}

The generated UUID is as follows:

8bca474b-214d-4ce8-8446-b99f30147f94
c38588cf-a1c4-4758-9d86-b2ee5552ae59
febf5a46-bd1b-43f8-89a8-d5606e5d1ce0

Since this version is very simple to use, it is the most widely used.

SnowFlake algorithm

Snowflake algorithm is Twitter's open source distributed ID generation algorithm. The snowflake algorithm uses timestamps, machine IDs, and different sequence numbers within the same millisecond to ensure the uniqueness of distributed ID generation .

Snowflake algorithm summary

1. The characteristics of the timestamp in the high position and the auto-increasing sequence in the low position can ensure that the trend of the entire ID is increasing and orderly.

2. However, because it relies on the machine clock, if the machine clock is set back, duplicate IDs may be generated. In a distributed environment, the clocks on each machine cannot be completely synchronized, and sometimes there may be situations where the clock does not increase globally.

The result of the ID generated by the SnowFlake algorithm is a 64-bit integer. Its structure is as follows:

  • 1bit, no need. In binary, the highest bit of 1 is a negative number, but the IDs we generate generally use integers, so the highest bit is fixed to 0.
  • 41bit, used to record timestamp (milliseconds). 41 bits can represent 2^{41}-1 numbers. If it is only used to represent positive integers (positive numbers in computers include 0), the value range that can be represented is 0 to 2^{41}-1, which means 41 bits It can represent a value of 2^{41}-1 milliseconds. When converted into unit years, it is (2^{41} - 1) / (1000*60*60*24*365) = 69 years.
  • 10bit, used to record the working machine ID. It can be deployed on 2^{10} = 1024 nodes, including 5-digit datacenterId and 5-digit workerId. The largest positive integer that can be represented by 5-digit (bit) is 2^{5}-1 = 31, that is, 0 and 1 can be used , 2, 3,..., 31 are 32 numbers to represent different datecenterId or workerId.
  • 12 bits, serial number, used to record different IDs generated within the same millisecond; the largest positive integer that can be represented by 12 bits is 2^{12}-1 = 4095, that is, 0, 1, 2, 3,...4094, the 4095 A number to represent the 4095 ID serial numbers generated by the same machine at the same time (milliseconds).

Ordered primary key or random primary key?

What about using random ID generation algorithms such as UUID as a MySQL primary key generation solution? The answer is: No!

As we all know, when a MySQL data table uses InnoDB as the storage engine, each index corresponds to a B+ tree. If the table defines a primary key (if not, MySQL will automatically generate an invisible auto-incrementing primary key), the index corresponding to the primary key is Clustered index, all data of the table is stored on the clustered index. The logical order of the key values ​​in the index determines the physical order of the corresponding rows in the table (the physical storage address of the data in the index is consistent with the order of the index). It can be understood this way: as long as the index is continuous, the storage location of the data on the storage medium is also continuous.

Based on the above characteristics, since the values ​​of the auto-increment key are ordered, when inserting data, Innodb will store each record after the previous record. When the maximum fill factor of the page is reached (the default maximum fill factor of InnoDB is 15/16 of the page size, 1/16 of the space will be left for future modifications), the following operations will be performed: the next record will be written In the new page, once the data is loaded in this order, the primary key page will be filled with nearly sequential records, which increases the maximum fill rate of the page and will not waste pages; and because the newly inserted row will definitely In the row next to the original largest data row, mysql positioning and addressing are very fast, and no additional consumption is incurred to calculate the position of the new row.

Compared with ordered auto-incrementing IDs, UUID values ​​have no rules at all. The primary key of a new row is not necessarily greater than the value of the primary key of the previous data, so innodb cannot always insert new rows into At the end of the index, you need to find a new suitable location for the new row to allocate new space. This process requires a lot of extra operations, and the resulting scattered data can lead to the following problems:

  1. The written target page is likely to have been flushed to disk and removed from the cache, or has not been loaded into the cache. InnoDB has to find and read the target page from disk into memory before inserting it, which will result in A lot of random IO;
  2. Because writes are out of order, InnoDB has to perform page splitting operations frequently in order to allocate space for new rows. Page splitting causes a large amount of data to be moved. At least three pages need to be modified at one time. Moreover, due to frequent page splitting, Pages will become sparse and filled irregularly, eventually leading to data fragmentation;
  3. After loading the values ​​into the clustered index (InnoDB's default index type), sometimes you need to do an OPTIMEIZE TABLE to rebuild the table and optimize page filling, which will take a certain amount of time.

Author: JD Retail Jinyue

Source: JD Cloud Developer Community Please indicate the source when reprinting

Guess you like

Origin blog.csdn.net/jdcdev_/article/details/132977568