Distributed Globally Unique ID Generation Strategy and Algorithm

Introduction to Distributed Global Unique ID

The unique id of the system is a problem we often encounter in the design phase. In complex distributed systems, it is almost always necessary to uniquely identify a large amount of data and messages. In the early stage of design, we need to consider the level of data volume in the future. If the data may be divided into databases and tables, then a globally unique id is required to identify a piece of data or record. There are many strategies for generating unique ids, but each strategy has its applicable scenarios, advantages and limitations.

Global unique id features:

Global uniqueness : Duplicate ID numbers cannot appear. Since it is a unique identifier, this is the most basic requirement; the
trend is increasing : the clustered index is used in the MySQL InnoDB engine, because most RDBMS use the B-tree data structure to store the index Data, in the selection of primary keys, we should try to use ordered primary keys to ensure write performance;
**Monotonically increasing: **Ensure that the next ID must be greater than the previous ID, such as transaction version number, IM incremental message, sorting, etc. Requirements;
information security: If the ID is continuous, it is very easy for malicious users to pick up the work, just download the specified URL directly in order; if it is an order number, it is even more dangerous, and competitors can directly know our daily orders quantity. Therefore, in some application scenarios, IDs will need to be irregular and irregular.
High availability : In addition to the requirements for the ID number itself, the business also has extremely high requirements for the availability of the ID number generation system. Imagine that if the ID generation system is paralyzed, That would spell disaster. So there can be no single point of failure;
sharding support : ShardingId can be controlled. For example, the articles of a certain user should be placed in the same shard, so that the query efficiency is high and the modification is easy;
the length is moderate.

1. Use the auto_increment of the database to generate

advantage:

  1. This method uses the original function of the database, so it is relatively simple
  2. Can guarantee the uniqueness
  3. Can guarantee incrementality
  4. The step between ids is fixed and customizable

shortcoming:

1. Strongly rely on DB. The syntax and implementation of different databases are different. It will be more troublesome when database migration, when multiple database versions are supported, or when tables and databases are divided. When the DB is abnormal, the entire system is unavailable, which is a fatal problem.

2. Single point of failure. In the case of a single database or read-write separation or one master and multiple slaves, only one master database can be generated. There is a risk of a single point of failure.

3. Data consistency problem. Configuring master-slave replication can increase availability as much as possible, but data consistency is difficult to guarantee in special cases. Inconsistencies during master-slave switching may result in repeated numbering.

4. Difficult to expand. It is difficult to expand when the performance does not meet the requirements. The performance bottleneck of ID issuance is limited to the read and write performance of a single MySQL.

improve proposals:

  • Redundant main library to avoid writing to a single point
  • The data is divided horizontally to ensure that the IDs generated by each master database are not duplicated

2. Use UUID or GUID

A common way to generate ids is to use programs to generate them.

The purpose of UUID (Universally Unique Identifier) ​​is to allow all elements in the distributed system to have unique identification information without specifying the identification information through the central control terminal. This way, everyone can create UUIDs that don't conflict with everyone else. In this case, there is no need to consider the problem of duplication of names when the database is created.

The standard form of UUID contains 32 hexadecimal numbers, divided into five segments by hyphens, 36 characters in the form of 8-4-4-4-12, example: 550e8400-e29b-41d4-a716-446655440000, so far So far, there are 5 ways to generate UUID in the industry. For details, see the UUID specification A Universally Unique IDentifier (UUID) URN Namespace released by IETF.

In Java, we can directly use the following API to generate UUID:

UUID uuid  =  UUID.randomUUID(); String s = UUID.randomUUID().toString();

advantage

Very simple : local generation, convenient code, convenient API call.
The performance is not high : the performance of the generated id is very good, there is no network consumption, and basically there will be no performance problems.
Unique in the world : In the case of database migration, system data merging, or database changes, you can calmly deal with it.
shortcoming

High storage cost : UUID is too long, 16 bytes and 128 bits, usually represented by a 36-length string, which is not applicable in many scenarios. If it is a massive database, you need to consider the issue of storage capacity.
Information insecurity : The algorithm for generating UUID based on the MAC address may cause the MAC address to be leaked. This vulnerability was once used to find the creator of the Melissa virus.
Not applicable as a primary key : When ID is used as a primary key, there may be some problems in a specific environment. For example, in the scenario of using a DB primary key, UUID is very inapplicable. UUID is often stored using strings, and the query efficiency is relatively low.
UUID is unordered : it is not monotonically increasing, and at this stage the mainstream database primary key indexes are all selected B+ tree indexes, and the insertion efficiency for primary keys with unordered lengths that are too long is relatively low.
The amount of transmitted data is large
and unreadable

solution

In order to solve the unreadable UUID, you can use the UUID to Int64 method.

In order to solve the problem of UUID disorder, NHibernate provides Comb algorithm (combined guid/timestamp) in its primary key generation method. Reserve 10 bytes of the GUID, and use the other 6 bytes to represent the time when the GUID was generated (DateTime)

3. Redis generates ID

When using a database to generate ID performance is not enough, we can try to use Redis to generate ID. This mainly relies on Redis being single-threaded , so it can also be used to generate globally unique IDs. It can be implemented with Redis atomic operations INCR and INCRBY .

advantage:

  • Relying on the database, it is flexible and convenient, and its performance is better than that of the database.
  • Numeric IDs are naturally sorted, which is very helpful for pagination or results that need to be sorted.

shortcoming:

  • If there is no Redis in the system, new components need to be introduced to increase the complexity of the system.
  • The workload of coding and configuration is relatively large.

First of all, you need to know the EVAL and EVALSHA commands of redis:

principle

Use the lua script execution function of redis to generate a unique ID through lua script on each node.

Generated IDs are 64-bit:

Use 41 bits to store time, accurate to milliseconds, and can be used for 41 years.

Use 12 bits to store the logical slice ID, the maximum slice ID is 4095

【advantage】

  1. It does not depend on the database, it is flexible and convenient, and its performance is better than that of the database. .
  2. Numeric IDs are naturally sorted, which is very helpful for pagination or results that need to be sorted.

【shortcoming】

  1. If there is no Redis in the system, new components need to be introduced to increase the complexity of the system. .
  2. The workload of coding and configuration is relatively large.
  3. Redis single point of failure affects the availability of sequence services.

4, zookeeper generated ID

Zookeeper mainly generates serial numbers through its znode data version, which can generate 32-bit and 64-bit data version numbers, and the client can use this version number as a unique serial number.

Zookeeper is rarely used to generate unique IDs. The main reason is that it needs to rely on zookeeper and call the API in multiple steps. If there is a lot of competition, you need to consider using distributed locks. Therefore, the performance is not ideal in a highly concurrent distributed environment.

5. Twitter’s open source Snowflake algorithm

snowflake is an open-source distributed ID generation algorithm for twitter. Its core idea is a long-type ID:

  • 41 bits as milliseconds - a length of 41 bits can be used for 69 years
  • 10 bits as the machine number (5 bits for the data center, 5 bits for the machine ID) - the length of 10 bits supports the deployment of up to 1024 nodes
  • 12 bit as serial number in milliseconds - 12-bit counting serial number supports each node to generate 4096 ID serial numbers per millisecond

image-20210830112559600

Its core idea is: use 41bit as the number of milliseconds, 10bit as the ID of the machine (5 bits for the data center, 5 bits for the machine ID), and 12bit as the serial number within milliseconds (meaning that each node can generate 4096 IDs), there is a sign bit at the end, which is always 0. The specific implementation code can be found on github.

The snowflake algorithm can be modified according to the needs of its own projects. For example, estimate the number of data centers in the future, the number of machines in each data center, and the possible concurrency in milliseconds to adjust the number of bits required in the algorithm.

Java implementation method

public class SnowflakeIdGenerator {
    
    

    private final long startTime = 1498608000000L;
    // 机器id所占的位数
    private final long workerIdBits = 5L;
    // 数据标识id所占的位数
    private final long dataCenterIdBits = 5L;
    // 支持的最大机器id(十进制),结果是31 (这个移位算法可以很快的计算出几位二进制数所能表示的最大十进制数)
    // -1L 左移 5位 (worker id 所占位数) 即 5位二进制所能获得的最大十进制数 - 31
    private final long maxWorkerId = -1L ^ (-1L << workerIdBits);
    // 支持的最大数据标识id - 31
    private final long maxDataCenterId = -1L ^ (-1L << dataCenterIdBits);
    // 序列在id中占的位数
    private final long sequenceBits = 12L;
    // 机器ID 左移位数 - 12 (即末 sequence 所占用的位数)
    private final long workerIdMoveBits = sequenceBits;
    // 数据标识id 左移位数 - 17(12+5)
    private final long dataCenterIdMoveBits = sequenceBits + workerIdBits;
    // 时间截向 左移位数 - 22(5+5+12)
    private final long timestampMoveBits = sequenceBits + workerIdBits + dataCenterIdBits;
    // 生成序列的掩码(12位所对应的最大整数值),这里为4095 (0b111111111111=0xfff=4095)
    private final long sequenceMask = -1L ^ (-1L << sequenceBits);
    //=================================================Works's Parameter================================================
    /**
     * 工作机器ID(0~31)
     */
    private long workerId;
    /**
     * 数据中心ID(0~31)
     */
    private long dataCenterId;
    /**
     * 毫秒内序列(0~4095)
     */
    private long sequence = 0L;
    /**
     * 上次生成ID的时间截
     */
    private long lastTimestamp = -1L;
    //===============================================Constructors=======================================================
    /**
     * 构造函数
     *
     * @param workerId     工作ID (0~31)
     * @param dataCenterId 数据中心ID (0~31)
     */
    public SnowflakeIdGenerator(long workerId, long dataCenterId) {
    
    
        if (workerId > maxWorkerId || workerId < 0) {
    
    
            throw new IllegalArgumentException(String.format("Worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (dataCenterId > maxDataCenterId || dataCenterId < 0) {
    
    
            throw new IllegalArgumentException(String.format("DataCenter Id can't be greater than %d or less than 0", maxDataCenterId));
        }
        this.workerId = workerId;
        this.dataCenterId = dataCenterId;
    }
    // ==================================================Methods========================================================
    // 线程安全的获得下一个 ID 的方法
    public synchronized long nextId() {
    
    
        long timestamp = currentTime();
        //如果当前时间小于上一次ID生成的时间戳: 说明系统时钟回退过 - 这个时候应当抛出异常
        if (timestamp < lastTimestamp) {
    
    
            throw new RuntimeException(
                    String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }
        //如果是同一时间生成的,则进行毫秒内序列
        if (lastTimestamp == timestamp) {
    
    
            sequence = (sequence + 1) & sequenceMask;
            //毫秒内序列溢出 即 序列 > 4095
            if (sequence == 0) {
    
    
                //阻塞到下一个毫秒,获得新的时间戳
                timestamp = blockTillNextMillis(lastTimestamp);
            }
        }
        //时间戳改变,毫秒内序列重置
        else {
    
    
            sequence = 0L;
        }
        //上次生成ID的时间截
        lastTimestamp = timestamp;
        //移位并通过或运算拼到一起组成64位的ID
        return ((timestamp - startTime) << timestampMoveBits) //
                | (dataCenterId << dataCenterIdMoveBits) //
                | (workerId << workerIdMoveBits) //
                | sequence;
    }
    // 阻塞到下一个毫秒 即 直到获得新的时间戳
    protected long blockTillNextMillis(long lastTimestamp) {
    
    
        long timestamp = currentTime();
        while (timestamp <= lastTimestamp) {
    
    
            timestamp = currentTime();
        }
        return timestamp;
    }
    // 获得以毫秒为单位的当前时间
    protected long currentTime() {
    
    
        return System.currentTimeMillis();
    }
    //====================================================Test Case=====================================================
    public static void main(String[] args) {
    
    
        SnowflakeIdGenerator idWorker = new SnowflakeIdGenerator(0, 0);
        for (int i = 0; i < 1000; i++) {
    
    
            long id = idWorker.nextId();
            System.out.println(Long.toBinaryString(id));
            System.out.println(id);
        }
    }
}  

advantage

It has high stability and does not depend on third-party systems such as databases. It is deployed as a service and has higher stability. The performance of generating IDs is also very high.
It is flexible and convenient, and bits can be allocated according to its own business characteristics.
On a single machine, the ID is monotonously increasing, the number of milliseconds is in the high position, and the auto-increment sequence is in the low position, and the entire ID is increasing in trend.
shortcoming

Strongly rely on the machine clock, if the clock on the machine is dialed back, it will cause repeated numbering or the service will be unavailable.

ID may not be globally incremented. It is incremental on a single machine, but due to the distributed environment involved, the clocks on each machine cannot be completely synchronized, and sometimes there may be cases where the global increment is not.
Follow my WeChat public account

​​
​​​​insert image description here

Guess you like

Origin blog.csdn.net/CharlesYooSky/article/details/120075054