Distributed Id generation-snowflake algorithm (Java)

Recently, the company is just doing database migration from Oracle to MySQL, because the Oracle primary key used SYS_GUID ()the function provided by this Oracle to generate a globally unique identifier (original value) consisting of 16 bytes. However, due to the clustered index used by the InnoDB storage engine used by mysql by default, the use of uuid has a certain impact on write performance. Moreover, for subsequent sub-database sub-table considerations, it is not advisable to use database self-increment, so it is considered that it is necessary to use an Id generation algorithm that can support distributed increment and is globally unique. After investigation, the snowflake algorithm is a classic algorithm for generating distributed global unique IDs open source by Twitter , and it can ensure that the overall increase in time.

After reviewing most of the information about the snowflake algorithm on the Internet, there are many interpretations about the snowflake algorithm on the Internet, and most of them are copied. I found that most of these tutorials have two problems:

Just interpret the principle of the official algorithm, and did not solve the configuration problem of the machine ID (5 digits) and the data center ID (5 digits). How to ensure the unique configuration of distributed deployment.
All Demo needs to instantiate objects, without forming out-of-the-box tool classes, and cannot be used directly in conjunction with the project.

This article aims to improve the two problems above, and hope to help friends who are like me who are preparing to use SnowFlake algorithm to generate database primary keys in their projects.

Overview

The result of SnowFlake algorithm generating id is a 64bit integer, and its structure is as follows:

snowflake

1位,No need to. In the binary system, the highest bit of 1 is a negative number, but the id we generate generally uses an integer, so the highest bit is fixed at 0
41位, Used to record the time stamp (milliseconds).
- 41 bits can represent $2^{41}$ number.
- If it is only used to represent a positive integer (a positive number in the computer contains 0), the range of values that can be represented is: 0 to $2^{41}-1$ minus 1 is because the range of representable values is calculated from 0, not 1.
- In other words, 41 bits can represent $2^{41}-1$ The value of $1$ millisecond is $2^{41}-1 /(1000*60*60*24*365) = 69$ years
10位, Used to record the working machine id.
- Can be deployed at $2^{10} = 1024$ nodes, including5位datacenterIdand5位workerId
- 5位（bit）The largest positive integer that can be represented is $2^5-1=31$ , that is, 32 numbers 0, $1$ , 2, 3, ... 31 can be used to represent different datecenterId or workerId
12位, Serial number, used to record different IDs generated in the same millisecond.
- 12位（bit）The largest positive integer that can be represented is $2^{12}-1=4095$ , that is, 4095 numbers such as 0, 1, 2, 3, ... 4094 can be used to represent the 4095 ID serial numbers generated by the same machine in the same time (milliseconds).

Since 64-bit integers in Java are of long type, the id generated by the SnowFlake algorithm in Java is stored in long.

SnowFlake can guarantee:

All generated ids increase in time trend
There will be no duplicate IDs in the entire distributed system (because there are datacenterId and workerId to distinguish)

Talk is cheap, show you the code

In response to the two problems raised at the beginning of the article, the author's solution is to use the server IP to generate workId and hostName to generate dataCenterId, which can prevent 10-digit machine code duplication to the greatest extent, but since both IDs cannot exceed 32, only the remainder can be taken. , It is still inevitable that there will be duplication, but in actual use, the configuration of hostName and IP are generally continuous or similar. As long as they are not exactly 32 bits apart, there will be no problem. Moreover, it is almost impossible that hostName and IP are separated by 32 at the same time. In general, the distributed deployment usually does not exceed 100 containers.

The above method can use the snowflake algorithm with zero configuration. The 10-bit machine code of the snowflake algorithm can theoretically have 1024 nodes. The docker configuration used in production is generally compiled once, and then distributed to different containers, there will be no difference Configuration. Here are several solutions that can completely avoid duplication. You can use redis to auto-increase and get the assigned machine code when the application starts. You can also use zk to issue the machine code, store the machine code corresponding to the machine under the permanent node of zk, and get it every time you start it. However, both of these solutions need to be configured and developed. In the case of fewer production deployment machines and less concurrency, the solutions provided in this article can be used. If there is a problem in the future, this solution of relying on middleware to issue machine code will be adopted, and supplements will be made at that time.

The complete code is as follows:

import org.apache.commons.lang3.RandomUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.net.Inet4Address;
import java.net.UnknownHostException;

public class SnowflakeIdWorker{
    
    

    private static final Logger LOGGER = LoggerFactory.getLogger(SnowflakeIdWorker.class);

    /** 工作机器ID(0~31) */
    private long workerId;
    /** 数据中心ID(0~31) */
    private long dataCenterId;
    /** 毫秒内序列(0~4095) */
    private long sequence = 0L;

    public SnowflakeIdWorker(long workerId, long dataCenterId){
    
    
        // sanity check for workerId
        if (workerId > maxWorkerId || workerId < 0) {
    
    
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0",maxWorkerId));
        }
        if (dataCenterId > maxDatacenterId || dataCenterId < 0) {
    
    
            throw new IllegalArgumentException(String.format("dataCenter Id can't be greater than %d or less than 0",maxDatacenterId));
        }
        LOGGER.info("worker starting. timestamp left shift = {}, dataCenter id bits = {}, worker id bits = {}, sequence bits = {}, workerid = {}, dataCenterId = {}",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId,dataCenterId);

        this.workerId = workerId;
        this.dataCenterId = dataCenterId;
    }

    /**初始时间戳*/
    private long twepoch = 1577808000000L;

    /**长度为5位*/
    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;

    /** 支持的最大机器id，结果是31 (这个移位算法可以很快的计算出几位二进制数所能表示的最大十进制数) */
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);
    /** 支持的最大数据标识id，结果是31 */
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);

    /** 序列在id中占的位数 */
    private long sequenceBits = 12L;
    /** 生成序列的掩码，这里为4095 (0b111111111111=0xfff=4095) */
    private long sequenceMask = -1L ^ (-1L << sequenceBits);
    
    //工作id需要左移的位数，12位
    private long workerIdShift = sequenceBits;
    //数据id需要左移位数 12+5=17位
    private long datacenterIdShift = sequenceBits + workerIdBits;
    //时间戳需要左移位数 12+5+5=22位
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    
    //上次时间戳，初始值为负数
    private long lastTimestamp = -1L;

    private static SnowflakeIdWorker idWorker;

    static {
    
    
        idWorker = new SnowflakeIdWorker(getWorkId(),getDataCenterId());
    }

    /**
     * 获得下一个ID (该方法是线程安全的)
     * @return SnowflakeId
     */
    public synchronized long nextId() {
    
    
        long timestamp = timeGen();

        //获取当前时间戳如果小于上次时间戳，则表示时间戳获取出现异常
        if (timestamp < lastTimestamp) {
    
    
            LOGGER.error("clock is moving backwards.  Rejecting requests until : {}.", lastTimestamp);
            throw new RuntimeException(String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds",
                    lastTimestamp - timestamp));
        }

        //获取当前时间戳如果等于上次时间戳（同一毫秒内），则在序列号加一；否则序列号赋值为0，从0开始。
        if (lastTimestamp == timestamp) {
    
    
            sequence = (sequence + 1) & sequenceMask;
            // 毫秒内序列溢出
            if (sequence == 0) {
    
    
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
    
    
            sequence = 0;
        }
        
        //将上次时间戳值刷新
        lastTimestamp = timestamp;

        /**
          * 返回结果：
          * (timestamp - twepoch) << timestampLeftShift) 表示将时间戳减去初始时间戳，再左移相应位数
          * (datacenterId << datacenterIdShift) 表示将数据id左移相应位数
          * (workerId << workerIdShift) 表示将工作id左移相应位数
          * | 是按位或运算符，例如：x | y，只有当x，y都为0的时候结果才为0，其它情况结果都为1。
          * 因为个部分只有相应位上的值有意义，其它位上都是0，所以将各部分的值进行 | 运算就能得到最终拼接好的id
        */
        return ((timestamp - twepoch) << timestampLeftShift) |
                (dataCenterId << datacenterIdShift) |
                (workerId << workerIdShift) |
                sequence;
    }

    /**
     * 阻塞到下一个毫秒，直到获得新的时间戳
     * @param lastTimestamp 上次生成ID的时间截
     * @return 当前时间戳
     */
    private long tilNextMillis(long lastTimestamp) {
    
    
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
    
    
            timestamp = timeGen();
        }
        return timestamp;
    }

    /**
     * 返回以毫秒为单位的当前时间
     * @return 当前时间(毫秒)
     */
    private long timeGen(){
    
    
        return System.currentTimeMillis();
    }

    private static Long getWorkId(){
    
    
        try {
    
    
            String hostAddress = Inet4Address.getLocalHost().getHostAddress();
            char[] chars = hostAddress.toCharArray();
            int sums = 0;
            for(int b : chars){
    
    
                sums += b;
            }
            return (long)(sums % 32);
        } catch (UnknownHostException e) {
    
    
            // 如果获取失败，则使用随机数备用
            return RandomUtils.nextLong(0,31);
        }
    }

    private static Long getDataCenterId(){
    
    
        try {
    
    
            char[] chars = Inet4Address.getLocalHost().getHostName().toCharArray();

            int sums = 0;
            for (int i: chars) {
    
    
                sums += i;
            }
            return (long)(sums % 32);
        } catch (UnknownHostException e) {
    
    
            // 如果获取失败，则使用随机数备用
            return RandomUtils.nextLong(0,31);
        }

    }

    /**
     * 静态工具类
     *
     * @return
     */
    public static Long generateId(){
    
    
        long id = idWorker.nextId();
        return id;
    }

    /** 测试 */
    public static void main(String[] args) {
    
    
        System.out.println(System.currentTimeMillis());
        long startTime = System.nanoTime();
        for (int i = 0; i < 50000; i++) {
    
    
            long id = SnowflakeIdWorker.generateId();
            LOGGER.info("id = {}",id);
        }
        LOGGER.info((System.nanoTime()-startTime)/1000000+"ms");
    }

}

Expand

After understanding this algorithm, there are actually some extended things you can do:

In theory, the 41-bit record timestamp can represent 69 years, and the timestamp is calculated from 1970. For now, the number of milliseconds between 1970 and 2019 is no longer available, so an initial time reference point can be set ( Generally set to the time when the id generator starts to use ), calculate the time stamp difference ( current time cut-off-start time cut-off ). In this way, the range of timestamps can be expanded.
Decrypt the id, because each segment of the id stores specific information, so if you get an id, you should be able to try to reverse the original information of each segment. The anti-introduction information can help us analyze. For example, as an order, you can know the date the order was generated, the data center responsible for processing, and so on.
Improve the strategy of generating machine id in the algorithm, and further adopt zk distribution or redis self-increment, etc., to achieve completely collision-free id generation.
Modify the information stored in each bit segment according to your own business. The algorithm is universal, you can adjust the size of each segment and the information stored according to your needs.

Reference materials:

Due to the extensive use of bit operations in the algorithm, if you don’t know much about it, you can refer to this analysis

https://segmentfault.com/a/1190000011282426