After the database is divided into tables, how to deal with the primary key ID?

Preface

        When the amount of data in a relational database is too large, sub-databases and tables are usually used to reduce the pressure of database table lookup. There are many types of sub-databases and tables, some are divided into one database and have multiple tables, and some are divided into multiple databases and multiple tables. Generally, ShardingSphere is used for sharding databases and tables, creating sharding keys, etc. But after the database is divided into tables, how to deal with the primary key ID? The primary key IDs of different sharded tables of the same business table cannot be the same. In fact, this is a problem you must face after sharding the database and tables, that is, how to generate the primary key id? Because if it is divided into multiple tables, and each table starts to accumulate from 1, that is definitely wrong. A globally unique id is needed to support it. So these are all issues that you must consider in your actual production environment.

The following are several ways of handling primary key IDs that I have compiled:

1. Automatically generate primary key ID

This method generally sets the primary key to bitint type, which is auto-incrementing. But there will be a problem. Multiple sub-tables ensure that the primary keys do not conflict, because from a business perspective, the data of multiple sub-tables constitute a certain business, so primary key conflicts are not allowed.
When using the solution of automatically generating primary key IDs, you can set up several fixed sub-tables. The starting point of each sub-table is different, and the step size of each new addition is the same, so that it can be guaranteed The primary keys of each sub-table do not conflict.

You can perform horizontal scaling by setting the database sequence or the auto-increment field step size of the table. There are now 10 service nodes. Each service node uses a sequence function to generate an ID. The starting ID of each sequence is different and increases in sequence with a step size of 10.

For example, if a certain table has 10 sub-tables, you can set the starting primary key ID of each table from 1 to 10, and the increment step of the primary key ID of each sub-table is 10.

Table Name Starting primary key ID step size
table_1 1 10
table_2 2 10
table_3 3 10
table_4 4 10
table_5 5 10
table_6 6 10
table_7 7 10
table_8 8 10
table_9 9 10
table_10 10 10

According to the above increasing rule of primary key of sub-table, the number of rows in each table increases as follows

There are disadvantages in following the primary key increment format, that is, it is difficult to handle the primary key logic when adding a new table. This method of incrementing the primary key ID is suitable for situations where the sub-tables are relatively fixed.

2. UUID as primary key

The advantage is that it is generated locally and not based on the database; the disadvantage is that the UUID is too long and takes up a lot of space.The performance as a primary key is too poor< a i=2>; more importantly, UUID is not ordered, which will cause too many random write operations when writing the B+ tree index (continuous IDs can produce partial sequential writes), and, due to the When writing, no sequential append operation can be generated, but an insert operation is required. The entire B+ tree node will be read into the memory. After inserting this record, the entire node will be written back to the disk. This operation will occupy more space in the record. In large cases, performance drops significantly.

Suitable scenarios: If you want to randomly generate file names, numbers, etc., you can use UUID, but you cannot use UUID as the primary key.

UUID.randomUUID().toString().replace("-", "") -> sfsdf23423rr234sfdaf

3. Get the current system time

This is just to get the current time, but the problem is,When the concurrency is very high, such as thousands of concurrencies per second, a>There will be duplication, which is definitely inappropriate. Basically you don’t have to think about it.

Suitable scenarios: Generally, if you use this solution, you will splice the current time with many other business fields as an ID. If you think it is acceptable in terms of business, then it is also acceptable. You can concatenate other business field values ​​with the current time to form a globally unique number.

4. Snowflake algorithm snowflake

(1) One placeholder: default is 0. The highest bit represents positive and negative, 1 represents negative number, 0 represents positive number, and the default is positive number.
(2) 41-bit timestamp: millisecond-level time, can be stored for 69 years, (1L << 41) / (1000L * 60 * 60 * 24 * 365) = 69 years
(3) 5-digit work center id: the decimal range is 0-31; 5-digit data center id: the decimal range is 0-31. The two combined can accommodate up to 1024 nodes.
(4) Serial number: occupies 12 bits and can be accumulated up to 4095. The self-increasing value supports that the same node can generate 4096 IDs in the same millisecond. This value continues to accumulate from 0 on the same node in the same millisecond. (It can support a maximum concurrency of almost four million on a single node)


The snowflake algorithm is Twitter's open source distributed ID generation algorithm, implemented in Scala language. It takes a 64-bit long ID, 1 bit is not used, uses 41 bits as the milliseconds, and uses 10 bits as the working machine. id, 12 bits as the serial number.

•1 bit: No, why? Because the first bit in binary is 1, then all are negative numbers, but the ids we generate are all positive numbers, so the first bit is uniformly 0.

•41 bits: represents the timestamp, the unit is milliseconds. 41 bits can represent up to 2^41 - 1 digits, that is, it can identify 2^41 - 1 millisecond values. When converted to an adult, it represents 69 years of time.

·10 bits: Record the working machine ID, which means that this service can be deployed on up to 2^10 machines, that is, 1024 machines. But among the 10 bits, 5 bits represent the computer room ID, and 5 bits represent the machine ID. This means that it can represent at most 2^5 computer rooms (32 computer rooms), and each computer room can represent 2^5 machines (32 machines).

•12 bits: This is used to record different IDs generated within the same millisecond. The largest positive integer that 12 bits can represent is 2^12 - 1 = 4096 , which means that this 12 bits can be used The represented number distinguishes 4096 different IDsin the same millisecond.

0 | 0001100 10100010 10111110 10001001 01011100 00 | 10001 | 1 1001 | 0000 00000000
public class IdWorker {

    private long workerId;
    private long datacenterId;
    private long sequence;

    public IdWorker(long workerId, long datacenterId, long sequence) {
        // sanity check for workerId
        // 这儿不就检查了一下,要求就是你传递进来的机房id和机器id不能超过32,不能小于0
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(
                    String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(
                    String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        System.out.printf(
                "worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);

        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }

    private long twepoch = 1288834974657L;

    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;

    // 这个是二进制运算,就是 5 bit最多只能有31个数字,也就是说机器id最多只能是32以内
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);

    // 这个是一个意思,就是 5 bit最多只能有31个数字,机房id最多只能是32以内
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    private long sequenceBits = 12L;

    private long workerIdShift = sequenceBits;
    private long datacenterIdShift = sequenceBits + workerIdBits;
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    private long sequenceMask = -1L ^ (-1L << sequenceBits);

    private long lastTimestamp = -1L;

    public long getWorkerId() {
        return workerId;
    }

    public long getDatacenterId() {
        return datacenterId;
    }

    public long getTimestamp() {
        return System.currentTimeMillis();
    }

    public synchronized long nextId() {
        // 这儿就是获取当前时间戳,单位是毫秒
        long timestamp = timeGen();

        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards.  Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format(
                    "Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }

        if (lastTimestamp == timestamp) {
            // 这个意思是说一个毫秒内最多只能有4096个数字
            // 无论你传递多少进来,这个位运算保证始终就是在4096这个范围内,避免你自己传递个sequence超过了4096这个范围
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }

        // 这儿记录一下最近一次生成id的时间戳,单位是毫秒
        lastTimestamp = timestamp;

        // 这儿就是将时间戳左移,放到 41 bit那儿;
        // 将机房 id左移放到 5 bit那儿;
        // 将机器id左移放到5 bit那儿;将序号放最后12 bit;
        // 最后拼接起来成一个 64 bit的二进制数字,转换成 10 进制就是个 long 型
        return ((timestamp - twepoch) << timestampLeftShift) | (datacenterId << datacenterIdShift)
                | (workerId << workerIdShift) | sequence;
    }

    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    private long timeGen() {
        return System.currentTimeMillis();
    }

    // ---------------测试---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1, 1, 1);
        for (int i = 0; i < 30; i++) {
            System.out.println(worker.nextId());
        }
    }

}

How to say it? It probably means this, that is, 41 bit is a timestamp in the current millisecond unit, that’s what it means; and then 5 bit is the one you passed incomputer room  id (but the maximum can only be within 32), and the other 5 bits are what you passed in machine id (but The maximum number can only be within 32), and the remaining 12-bit serial number means that if it is within one millisecond of the last time you generated an ID, the sequence will be accumulated for you, up to a maximum of 4096 serial numbers.

So you use this tool class to create a service yourself, and then initialize such a thing for each machine in each computer room. At the beginning, the serial number of the machine in this computer room is 0. Then every time you receive a request saying that this machine in this computer room needs to generate an ID, you will find the corresponding Worker to generate.

Using this snowflake algorithm, you can develop your own company's services. Even for the computer room ID and machine ID, 5 bit + 5 bit are reserved for you anyway. You can also change it to something else with business meaning.

This snowflake algorithm is relatively reliable, so if you are really engaged in distributed ID generation, if it is high concurrency, then using this should have better performance. Generally speaking, it is enough for you to use tens of thousands of concurrency scenarios per second. .

Summarize:

In addition to the above ID generation algorithms, there are of course other primary key ID generation algorithms that have not been sorted out. The specific ones to use need to be used according to the business situation.

Guess you like

Origin blog.csdn.net/m0_61243965/article/details/134409932