Large Internet companies distributed ID program summary

ID is a unique identifier of the data, the traditional approach is to use UUID and database auto-increment ID, Internet companies, most companies are using Mysql, and because of the need to support the transaction, it is often used Innodb storage engine, UUID after too long and disorderly, it is not suitable as a primary key in Innodb, increment ID is appropriate, but with the company's business development, the amount of data increases, the need for data points table, while the sub-table each table data will be incremented at their own pace, it is likely to arise ID conflicts. Then you need to be responsible for a separate mechanism to generate a unique ID, generated out of the ID can also be called a distributed ID , or a global ID . To analyze the following mechanisms generating respective ID of the distributed.

This article does not particularly detailed analysis, mainly to do some summary, a later article detailed a number of programs.

Database increment ID

In still a first embodiment based on the self-energizing ID database, a database requires a separate instance, in this example a new separate table:

Table structure is as follows:

CREATE DATABASE `SEQID`;

CREATE TABLE SEQID.SEQUENCE_ID (
	id bigint(20) unsigned NOT NULL auto_increment, 
	stub char(10) NOT NULL default '',
	PRIMARY KEY (id),
	UNIQUE KEY stub (stub)
) ENGINE=MyISAM;
复制代码

The following statement can be used to generate and obtain a self-energizing ID

begin;
replace into SEQUENCE_ID (stub) VALUES ('anyword');
select last_insert_id();
commit;
复制代码

stub field here and there is no special meaning, just to easy to insert data, only data can be inserted to produce increment id. For insertion we use replace, replace the same stub will look at whether there is a specified value of the data, if it exists first delete and then insert, if it does not exist directly insert.

This ID is distributed generation mechanism requires a separate Mysql example, while feasible, but if based on the performance and reliability enough to consider, operational needs every time a system ID, the database needs to request acquisition, low performance, and If the database instance is down, it will affect all business systems.

To address database reliability problems, we can use the second generation of distributed ID program.

Multi-master database

If we have two databases form a master-slave mode cluster, under normal circumstances, the database can solve the reliability problems, but if the main library hang up, the data is not synchronized in time from the library, this time there will be duplication of ID. We can use dual master mode cluster, that is, two instances Mysql can separate production increment ID, this can improve efficiency, but if you do not go through another transformation, then it Mysql two instances are likely to generate the same ID. Mysql requires a separate instance for each different configuration of the start value and increment step size.

The first configuration example Mysql:

set @@auto_increment_offset = 1;     -- 起始值
set @@auto_increment_increment = 2;  -- 步长
复制代码

A second configuration example Mysql:

set @@auto_increment_offset = 2;     -- 起始值
set @@auto_increment_increment = 2;  -- 步长
复制代码

After the above configuration, the two instances Mysql id generated sequence is as follows: mysql1, a starting value of 1, in steps of 2, ID is generated sequence: 1,3,5,7,9, ... mysql2 the start value is 2, step 2, the resulting sequence ID: 2,4,6,8,10, ...

For generation of such a distributed program ID, a need to add a separate ID generating distributed applications, such as DistributIdService, the application provides an interface for obtaining service application ID, a business application needs ID, request DistributIdService, DistributIdService by way rpc Mysql randomized to two examples above to obtain ID.

After the implementation of this embodiment, even if a station wherein Mysql instance down, it will not affect DistributIdService, DistributIdService still further be utilized to generate a Mysql ID.

But the extension of the scheme is not very good, Mysql instance if two is not enough, you need to add Mysql instance to improve the performance, then you will have trouble.

Now if you want to add an instance mysql3, to how does it work? First, mysql1, mysql2 step size would certainly be revised to 3, and can only be artificial to modify, it takes time. Second, because mysql1 and mysql2 are kept in auto-incremented for mysql3 starting value we may have to set a little too big, in order to give sufficient time to modify mysql1, mysql2 step size. Third, it may appear in the modification step when repeated ID, to solve this problem, you may need to shut down the job.

In order to solve the above problems, and the ability to further improve the performance DistributIdService, distributed mechanism ID If the third generation.

No segment mode

We can use the style section to get increment ID, number of segments can be understood as batch acquisition, such as DistributIdService from the database to get the ID, if they can get bulk multiple ID and cached locally, then that will greatly provide business application to obtain ID s efficiency.

For example DistributIdService each database acquired from the ID, to obtain a number of segments, such as (1, 1000], this range denotes the ID 1000, when requesting the service application provides DistributIdService ID, DistributIdService only incremented beginning from a local and to return without having to request each time the database until the local increment to 1000, when the current segment number has been used up, before going to the database to reacquire the lower number one.

So, we need to make changes to the database table, as follows:

CREATE TABLE id_generator (
  id int(10) NOT NULL,
  current_max_id bigint(20) NOT NULL COMMENT '当前最大id',
  increment_step int(10) NOT NULL COMMENT '号段的长度',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
复制代码

The database table is used to record length of the self-energizing step, and the current ID is increased from a maximum value (i.e. the current has been applied to the last segment of a value of number), since the logic is incremented to go DistributIdService moved, so the database does not this part of the logic.

This program is no longer dependent on strong database, even if the database is unavailable, DistributIdService can continue to support a period of time. But if DistributIdService restart, lost some ID, resulting in ID empty.

In order to improve the availability DistributIdService, you need to make a cluster, the cluster service upon request DistributIdService get ID, will randomly select a node to obtain a DistributIdService, each DistributIdService node, the database is the same database connection, you may generate a plurality of nodes simultaneously request DistributIdService database accession number section, then the time needed to control the use of positive locking, such as adding a version field in a database table using the following SQL number segment in acquiring:

update id_generator set current_max_id=#{newMaxId}, version=version+1 where version = #{version}
复制代码

Because newMaxId DistributIdService is calculated in accordance with oldMaxId + step, as long as the above update update was successful, says the number segment for success.

In order to provide high availability database layer, it is necessary to use multi-master database deployment for each database is to ensure that the number of segments generated are not repeated, which requires the use of the idea of ​​the beginning, and then increase the database table just in start value and the step size, for example, if it were two Mysql, then the number generated mysql1 segments (1, 1001], the time increment is 1,3,4,5,7 .... mysql1 sequence number will be generated paragraph (2,1002], the time increment sequence 2,4, 6,8 ...

More details can refer to pieces of open source TinyId: github.com/didi/tinyid...

In TinyId also adds a step to improve efficiency, the above implementation, the increment logical ID is implemented in DistributIdService in fact can increment logic proceeds to a local service application, so that applications for business only you need to obtain paragraph, no longer need to request a call DistributIdService each time increment.

Snow algorithm

The above three methods in general is based on the idea of ​​self-growth, and the next will introduce more famous snowflake algorithm -snowflake.

We can think of distributed ID from another angle, as long as make responsible for generating distributed generation ID of each machine is not the same in every millisecond ID on the line.

snowflake is distributed twitter ID open generation algorithm is an algorithm, and it generates the above three types of distributed mechanism ID is not the same, it does not rely on the database.

The core idea is: a long distributed-type ID is a fixed number, a type of 8 bytes long, that is 64 bit, the original snowflake algorithm for bit allocation as shown below:

image.png

  • The first identification part is one bit, the most significant bit in java since long is the sign bit, 0 is a positive number, negative number is 1, generates ID for the generally positive, it is fixed to 0.
  • Timestamp part represents 41bit, this is at the millisecond level, the current timestamp are not stored on the general implementation, but the time stamp difference value (current time - starting time constant), so that the ID may be generated from smaller start value; timestamp 41 may be used 69 years, (1L << 41) / (1000L * 60 * 60 * 24 * 365) = 69 years
  • 10bit account id working machine, where more flexible, for example, the top 5 may be used as identification data center room, the room 5 as a standalone machine identification, node 1024 may be deployed.
  • Part represents the serial number 12bit, support the same node may generate the same ID 4096 ms

According to this logic algorithms, this algorithm only need to come out with the Java language, packaged as a tool method, then various business applications can use this tool direct method to obtain a distributed ID, just to ensure that each business has its own job application id machine can, without requiring a separate application to build a distributed obtain the ID.

snowflake algorithm is not difficult, provided the realization of java with a GitHub: github.com/beyondfengy...

In a large factory, in fact, did not directly use the snowflake, but has been transformed, because snowflake algorithm is the most difficult to practice working machine id, original snowflake algorithms need to manually go to a machine id specified for each machine, and configure somewhere allowing snowflake obtain the machine id from here.

But in the big factory, the machine is a lot of labor costs too error-prone, so the manufacturers of snowflake has been transformed.

Baidu (uid-generator)

github Address: uid-Generator

uid-generator use is the snowflake, but the production machine id, time is also different is called workId.

When the application is started: uid-generator in workId is automatically generated by uid-generator, and taking into account the case of applications deployed on docker, users can themselves define workId generation strategy in uid-generator, the policy provided by default is assigned by the database. He said the simple point is: the data is returned after use when it starts to database tables (uid-generator WORKER_NODE need to add a table) to insert a data insert data corresponding increment of success is the unique id of the machine workId and the data from the host, port composition.

For uid-generator in WorkId, occupies bits 22 bit, 28 bit time occupied bits, a sequence of 13 bit occupies bit, to be noted that, snowflake and not the same as the original, the second unit of time rather than milliseconds, workId not the same, the same application restart each time a consumer will workId.

Specific reference github.com/baidu/uid-g...

US group (Leaf)

github address: Leaf

Leaf US group ID is also a distributed generation framework. It is very comprehensive, which is to support segment model number, also supports snowflake pattern. No Burst mode is not presented here, and analysis similar to the above.

Snowflake pattern is different from the original algorithm Leaf the snowflake, mainly in workId generating, based Leaf workId in order to generate the Id ZooKeeper, each application when using Leaf-snowflake, at startup will be in Zookeeper generating a sequence of Id, a machine equivalent to a sequence corresponding to the node, i.e. a workId.

to sum up

Generally speaking, the above two are automatically generated workId, in order to make the system more stable and reduce labor success.

Redis

Here again introduce the additional use Redis to generate distributed ID, and in fact use Mysql similar increment ID, you can use Redis in incr command to achieve atomic increment and return, such as:

127.0.0.1:6379> set seq_id 1     // 初始化自增ID为1
OK
127.0.0.1:6379> incr seq_id      // 增加1,并返回
(integer) 2
127.0.0.1:6379> incr seq_id      // 增加1,并返回
(integer) 3
复制代码

Redis use efficiency is very high, but to consider the persistence of the problem. Redis supports two kinds of AOF RDB and persistent manner.

RDB persistent equivalent of playing a timed snapshot persistent, if a snapshot is finished, continuous increment several times, a snapshot persistence have not had time to do at this time Redis hung up, there will be repeated after the restart Redis ID .

AOF persistent equivalent to each write command for persistence, if Redis hang, will not be duplication of ID, but due to incr command flies, leading to restart data recovery time is too long.

Want to learn more distributed technology, add a micro-channel public number: 1:25

reny125.jpeg

Guess you like

Origin juejin.im/post/5d6fc8eff265da03ef7a324b