[Spring Cloud Series] Snowflake algorithm principle and implementation

Article directory

[Spring Cloud Series] Snowflake algorithm principle and implementation

I. Overview

In a distributed high-concurrency environment, a common ticket booking is the 12306 holiday. When a large number of users rush to buy tickets in the same direction, tens of thousands of orders may be generated in milliseconds. At this time, in order to ensure the uniqueness of the generated order ID, is crucial. In this flash sale environment, not only must the uniqueness of the ID be ensured, but the priority of ID generation must also be ensured.

2. Part of the rigid requirements for generating ID rules

Globally unique : Duplicate ID numbers cannot appear. Since it is a unique identifier, this is the most basic requirement.
Increasing trend : Clustered indexes are applicable in MySQL's InnoDB engine. Since most RDBMS use the B+Tree data structure to store index data, in the selection of primary keys, we try to use ordered primary keys to ensure writing performance.
Monotonically increasing : Ensure that the next ID must be greater than the previous ID, such as transaction version number, sorting and other special requirements.
Information security : If the ID is continuous, it will be very easy for malicious users to capture it. They can just download the specified URL in order; if it is an order number, it will be dangerous.
Contains timestamp : The generated ID contains complete timestamp information.

3. Availability requirements of ID number generation system

High availability : Send a request to obtain a distributed ID, and the server is guaranteed to create a unique distributed ID for me in 99.9999% of cases.
Low latency : To send a request to obtain a distributed ID, the server must be fast and extremely fast.
High QPS : If 100,000 distributed IDs are requested at one time, the server must withstand and successfully create 100,000 distributed IDs.

4. General solution to distributed ID

4.1 UUID

The standard form of UUID (Universally Unique Identifier) contains 32 hexadecimal digits, divided into five segments with hyphens, in the form: 36 characters of 8-4-4-4-12, example: 1E785B2B-111C-752A- 997B-3346E7495CE2; UUID performance is very high, does not rely on the network, and is generated locally.

UUID disadvantages:

Unordered, it is impossible to predict the order in which it is generated, and it cannot generate numbers in increasing order. MySql officially recommends that the primary key be as short as possible. UUID is a 32-bit string, so it is not recommended.
Index, split of B+Tree index

The distributed ID is the primary key, and the primary key is the clustered index. Mysql's index is implemented by B+Tree. Every time new UUID data is inserted, for the purpose of inserting new UUID data and for query optimization, the B+Tree at the bottom of the index will be modified; because UUID data is unordered , so every time UUID data is inserted, the clustered index of the primary key will be greatly modified. When doing data Insert, the primary key will be inserted out of order, which will cause the split of some intermediate nodes and lead to a large number of unsaturated node. This greatly reduces the performance of database insertion.

4.2 Database auto-increment primary key

Standalone

In distributed systems, the main principle of the database's self-increasing ID mechanism is: the database's self-increasing ID is implemented by replace into of the MySql database.

The meaning of Replace into is to insert a record. If the value of the unique index in the table encounters a conflict, the old data will be replaced.

In single application, self-increasing ID is used, but in cluster distributed application, single application is not suitable.

It is difficult to expand the system horizontally. For example, after defining the growth step size and the number of machines, when adding a large number of servers, the initial values need to be reset. This has poor operability, so the system horizontal expansion solution is highly complex and difficult to implement.
The database is under great pressure. Each time you obtain the ID, you need to read and write the database, which greatly affects performance. It does not comply with the rules of low latency and high QPS in distributed ID (under high concurrency, if you go to the database to obtain the ID, it will be very affected). performance.)

4.3 Generate global id strategy based on Redis

In the case of Redis cluster, different growth steps need to be set like MySql, and the key must have a validity period. Redis cluster can be used to obtain higher throughput.

5. SnowFlake (snowflake algorithm)

Twitter's SnowFlake solved this need. Initially, Twitter migrated the storage system from MySQL to Cassandra (an open source distributed NoSQL database system developed by Facebook). Because Cassandra did not have a sequential ID generation mechanism, it developed such a set of globally unique IDs. Generate services. SnowFlake can generate 260,000 auto-increasing sortable IDs per second.

5.1 SnowFlake Features

Twitter's SnowFlake generated IDs can be generated in time order.
The result of the Id generated by the SnowFlake algorithm is a 64-bit integer, which is a Long type (the maximum length after conversion to a string is 19).
There will be no ID collisions in the distributed system (distinguished by datacenter and workerid) and the efficiency is high.

5.2 SnowFlake structure

Insert image description here

5.3 Principle of Snowflake Algorithm

The principle of the snowflake algorithm is to generate a 64-bit long type unique id

The highest 1 bit has a fixed value of 0, because the generated ID is a positive integer, and if it is 1, it is a negative value.
Followed by 41 bits to store millisecond timestamps, 2^41/(1000 * 60 * 24 * 365) = 69, which can be used for approximately 69 years.
The next 10 digits store the machine code, including the 5-digit DataCenterId and the 5-digit WorkerId. Up to 2^10=1024 machines can be deployed.
The last 12 bits store the sequence number. When the same millisecond timestamp is used, it is distinguished by this incremental sequence number. That is, for the same machine, under the same millisecond time stamp, 2^12=4096 unique IDs can be generated.

The Snowflake algorithm can be deployed as a separate service, and then for systems that require a globally unique ID, just request the Snowflake algorithm service to obtain the ID.

For each Snowflake algorithm service, you need to specify a 10-digit machine code first. This can be set according to your own business. For example, computer room number + machine number, machine number + service number, or other 10-digit integers that distinguish the identifier.

5.4 Algorithm implementation

package com.goyeer;
import java.util.Date;

/**
 * @ClassName: SnowFlakeUtil
 * @Author: goyeer
 * @Date: 2023/09/09 19:34
 * @Description:
 */
public class SnowFlakeUtil {
    
    

    private static SnowFlakeUtil snowFlakeUtil;
    static {
    
    
        snowFlakeUtil = new SnowFlakeUtil();
    }

    // 初始时间戳(纪年)，可用雪花算法服务上线时间戳的值
    //
    private static final long INIT_EPOCH = 1694263918335L;

    // 时间位取&
    private static final long TIME_BIT = 0b1111111111111111111111111111111111111111110000000000000000000000L;

    // 记录最后使用的毫秒时间戳，主要用于判断是否同一毫秒，以及用于服务器时钟回拨判断
    private long lastTimeMillis = -1L;

    // dataCenterId占用的位数
    private static final long DATA_CENTER_ID_BITS = 5L;

    // dataCenterId占用5个比特位，最大值31
    // 0000000000000000000000000000000000000000000000000000000000011111
    private static final long MAX_DATA_CENTER_ID = ~(-1L << DATA_CENTER_ID_BITS);

    // dataCenterId
    private long dataCenterId;

    // workId占用的位数
    private static final long WORKER_ID_BITS = 5L;

    // workId占用5个比特位，最大值31
    // 0000000000000000000000000000000000000000000000000000000000011111
    private static final long MAX_WORKER_ID = ~(-1L << WORKER_ID_BITS);

    // workId
    private long workerId;

    // 最后12位，代表每毫秒内可产生最大序列号，即 2^12 - 1 = 4095
    private static final long SEQUENCE_BITS = 12L;

    // 掩码（最低12位为1，高位都为0），主要用于与自增后的序列号进行位与，如果值为0，则代表自增后的序列号超过了4095
    // 0000000000000000000000000000000000000000000000000000111111111111
    private static final long SEQUENCE_MASK = ~(-1L << SEQUENCE_BITS);

    // 同一毫秒内的最新序号，最大值可为 2^12 - 1 = 4095
    private long sequence;

    // workId位需要左移的位数 12
    private static final long WORK_ID_SHIFT = SEQUENCE_BITS;

    // dataCenterId位需要左移的位数 12+5
    private static final long DATA_CENTER_ID_SHIFT = SEQUENCE_BITS + WORKER_ID_BITS;

    // 时间戳需要左移的位数 12+5+5
    private static final long TIMESTAMP_SHIFT = SEQUENCE_BITS + WORKER_ID_BITS + DATA_CENTER_ID_BITS;

    /**
     * 无参构造
     */
    public SnowFlakeUtil() {
    
    
        this(1, 1);
    }

    /**
     * 有参构造
     * @param dataCenterId
     * @param workerId
     */
    public SnowFlakeUtil(long dataCenterId, long workerId) {
    
    
        // 检查dataCenterId的合法值
        if (dataCenterId < 0 || dataCenterId > MAX_DATA_CENTER_ID) {
    
    
            throw new IllegalArgumentException(
                    String.format("dataCenterId 值必须大于 0 并且小于 %d", MAX_DATA_CENTER_ID));
        }
        // 检查workId的合法值
        if (workerId < 0 || workerId > MAX_WORKER_ID) {
    
    
            throw new IllegalArgumentException(String.format("workId 值必须大于 0 并且小于 %d", MAX_WORKER_ID));
        }
        this.workerId = workerId;
        this.dataCenterId = dataCenterId;
    }

    /**
     * 获取唯一ID
     * @return
     */
    public static Long getSnowFlakeId() {
    
    
        return snowFlakeUtil.nextId();
    }

    /**
     * 通过雪花算法生成下一个id，注意这里使用synchronized同步
     * @return 唯一id
     */
    public synchronized long nextId() {
    
    
        long currentTimeMillis = System.currentTimeMillis();
        System.out.println(currentTimeMillis);
        // 当前时间小于上一次生成id使用的时间，可能出现服务器时钟回拨问题
        if (currentTimeMillis < lastTimeMillis) {
    
    
            throw new RuntimeException(
                    String.format("可能出现服务器时钟回拨问题，请检查服务器时间。当前服务器时间戳：%d，上一次使用时间戳：%d", currentTimeMillis,
                            lastTimeMillis));
        }
        if (currentTimeMillis == lastTimeMillis) {
    
    
            // 还是在同一毫秒内，则将序列号递增1，序列号最大值为4095
            // 序列号的最大值是4095，使用掩码（最低12位为1，高位都为0）进行位与运行后如果值为0，则自增后的序列号超过了4095
            // 那么就使用新的时间戳
            sequence = (sequence + 1) & SEQUENCE_MASK;
            if (sequence == 0) {
    
    
                currentTimeMillis = getNextMillis(lastTimeMillis);
            }
        } else {
    
     // 不在同一毫秒内，则序列号重新从0开始，序列号最大值为4095
            sequence = 0;
        }
        // 记录最后一次使用的毫秒时间戳
        lastTimeMillis = currentTimeMillis;
        // 核心算法，将不同部分的数值移动到指定的位置，然后进行或运行
        // <<：左移运算符, 1 << 2 即将二进制的 1 扩大 2^2 倍
        // |：位或运算符, 是把某两个数中, 只要其中一个的某一位为1, 则结果的该位就为1
        // 优先级：<< > |
        return
                // 时间戳部分
                ((currentTimeMillis - INIT_EPOCH) << TIMESTAMP_SHIFT)
                        // 数据中心部分
                        | (dataCenterId << DATA_CENTER_ID_SHIFT)
                        // 机器表示部分
                        | (workerId << WORK_ID_SHIFT)
                        // 序列号部分
                        | sequence;
    }

    /**
     * 获取指定时间戳的接下来的时间戳，也可以说是下一毫秒
     * @param lastTimeMillis 指定毫秒时间戳
     * @return 时间戳
     */
    private long getNextMillis(long lastTimeMillis) {
    
    
        long currentTimeMillis = System.currentTimeMillis();
        while (currentTimeMillis <= lastTimeMillis) {
    
    
            currentTimeMillis = System.currentTimeMillis();
        }
        return currentTimeMillis;
    }

    /**
     * 获取随机字符串,length=13
     * @return
     */
    public static String getRandomStr() {
    
    
        return Long.toString(getSnowFlakeId());
    }

    /**
     * 从ID中获取时间
     * @param id 由此类生成的ID
     * @return
     */
    public static Date getTimeBySnowFlakeId(long id) {
    
    
        return new Date(((TIME_BIT & id) >> 22) + INIT_EPOCH);
    }

    public static void main(String[] args) {
    
    
        SnowFlakeUtil snowFlakeUtil = new SnowFlakeUtil();
        long id = snowFlakeUtil.nextId();

        System.out.println(id);
        Date date = SnowFlakeUtil.getTimeBySnowFlakeId(id);
        System.out.println(date);
        long time = date.getTime();
        System.out.println(time);
        System.out.println(getRandomStr());

    }

}

5.4 Advantages of Snowflake Algorithm

Generate unique IDs in a high-concurrency distributed environment, and can generate millions of unique IDs per second.
Based on the timestamp and the automatic increment of the sequence number under the same timestamp, it is basically guaranteed that the id increases in an orderly manner.
Does not rely on third-party libraries or middleware.
The algorithm is simple, performed in memory, and highly efficient.

5.5 Disadvantages of Snowflake Algorithm:

Dependent on server time, duplicate IDs may be generated when the server clock is set back. The algorithm can be solved by recording the timestamp when the last ID was generated. Before each ID generation, compare whether the current server clock has been set back to avoid generating duplicate IDs.

6. Summary

In fact, the number of bits occupied by each part of the snowflake algorithm is not fixed. For example, your business may not be 69 years old, so you can reduce the number of bits occupied by the timestamp. If the Snowflake algorithm service needs to deploy more than 1024 nodes, you can use the reduced number of bits for machine code.

Note that the 41 bits in the Snowflake algorithm are not directly used to store the current server millisecond timestamp, but require the current server timestamp minus an initial timestamp value. Generally, the service online time can be used as the initial timestamp value.

For the machine code, you can adjust it according to your own situation. For example, the computer room number, server number, business number, machine IP, etc. are all available. For different snowflake algorithm services deployed, the final calculated machine code can be distinguished.