Twitter Snowflake

This article mainly introduces how to generate a globally unique ID in a distributed system

1. Problem description

In a scenario where there are multiple shards in a distributed system, when data is inserted into each shard at the same time, how can a global unique ID be generated for these data?

In a stand-alone system (such as a MySQL instance), the generation of unique ID is very simple, and it can be achieved directly by using the self-incrementing ID function that comes with MySQL.

But in a distributed system with multiple shards (for example, multiple MySQL instances form a cluster, and data is inserted into the cluster), this problem becomes complicated, and the generated global unique ID must meet the following requirements:

Ensure that the generated ID is globally unique
In the future, data migration between multiple shards will not be limited by the way of ID generation
It is best to bring time information in the generated ID. For example, the first k bits of the ID are Timestamp, so that the data can be sorted by time directly by sorting the first k bits of the ID.
The generated ID should preferably be no larger than 64 bits
There are requirements for the speed of ID generation. For example, in a high-throughput scenario, tens of thousands of IDs need to be generated per second (Twitter's latest peak reached 143,199 Tweets/s, which is 100,000+/sec)
The whole service is best without a single point

Without these limitations, the problem would be relatively simple, for example:

Directly use the UUID.randomUUID() interface to generate a unique ID (http://www.ietf.org/rfc/rfc4122.txt). However, the ID generated by this scheme has 128 bits. In addition, the generated ID does not contain Timestamp.
Utilize a central server to uniformly generate unique IDs. However, this scheme may have a single point of problems; in addition, to support high-throughput systems, this scheme needs to do a lot of improvement work (for example, each time a batch is obtained from the central server in batches IDs, improve the throughput rate generated by ID)
Flickr's approach (http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/). But his scheme ID does not contain Timestamp, Generated IDs cannot be sorted by time

How to generate a global unique ID in a scenario that meets the requirements of the previous 6 points?

Twitter's Snowflake is a good practice. The following mainly introduces Twitter Snowflake, and its variants

二, Twitter Snowflake

https://github.com/twitter/snowflake

The composition of the unique ID generated by Snowflake (from high to low):

41 bits: Timestamp (millisecond level)
10 bits: 节点 ID (datacenter ID 5 bits + worker ID 5 bits)
12 bits: sequence number

A total of 63 bits (the most significant bit is 0)

Unique ID generation process:

The 10-bit machine number, obtained from a Zookeeper cluster when the ID allocation worker starts up (to ensure that all workers will not have duplicate machine numbers)
Timestamp of 41 bits: Every time a new ID is to be generated, the current Timestamp will be obtained, and then the sequence number will be generated in two cases:
If the current Timestamp is the same as the Timestamp of the previous generated ID (in the same millisecond), use the sequence number + 1 of the previous ID as the new sequence number (12 bits); if all IDs in this millisecond are used up, Wait until the next millisecond to continue (during this waiting process, no new ID can be assigned)
如果当前的 Timestamp 比前一个 ID 的 Timestamp 大, 随机生成一个初始 sequence number (12 bits) 作为本毫秒内的第一个 sequence number

整个过程中, 只是在 Worker 启动的时候会对外部有依赖 (需要从 Zookeeper 获取 Worker 号), 之后就可以独立工作了, 做到了去中心化.

异常情况讨论:

在获取当前 Timestamp 时, 如果获取到的时间戳比前一个已生成 ID 的 Timestamp 还要小怎么办? Snowflake 的做法是继续获取当前机器的时间, 直到获取到更大的 Timestamp 才能继续工作 (在这个等待过程中, 不能分配出新的 ID)

从这个异常情况可以看出, 如果 Snowflake 所运行的那些机器时钟有大的偏差时, 整个 Snowflake 系统不能正常工作 (偏差得越多, 分配新 ID 时等待的时间越久)

从 Snowflake 的官方文档 (https://github.com/twitter/snowflake/#system-clock-dependency) 中也可以看到, 它明确要求 "You should use NTP to keep your system clock accurate". 而且最好把 NTP 配置成不会向后调整的模式. 也就是说, NTP 纠正时间时, 不会向后回拨机器时钟.

三, Snowflake 的其他变种

Snowflake 有一些变种, 各个应用结合自己的实际场景对 Snowflake 做了一些改动. 这里主要介绍 3 种.

1. Boundary flake

http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang/

变化:

ID 长度扩展到 128 bits:
最高 64 bits 时间戳;
然后是 48 bits 的 Worker 号 (和 Mac 地址一样长);
最后是 16 bits 的 Seq Number
由于它用 48 bits 作为 Worker ID, 和 Mac 地址的长度一样, 这样启动时不需要和 Zookeeper 通讯获取 Worker ID. 做到了完全的去中心化
基于 Erlang

它这样做的目的是用更多的 bits 实现更小的冲突概率, 这样就支持更多的 Worker 同时工作. 同时, 每毫秒能分配出更多的 ID

2. Simpleflake

http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/

Simpleflake 的思路是取消 Worker 号, 保留 41 bits 的 Timestamp, 同时把 sequence number 扩展到 22 bits;

Simpleflake 的特点:

sequence number 完全靠随机产生 (这样也导致了生成的 ID 可能出现重复)
没有 Worker 号, 也就不需要和 Zookeeper 通讯, 实现了完全去中心化
Timestamp 保持和 Snowflake 一致, 今后可以无缝升级到 Snowflake

Simpleflake 的问题就是 sequence number 完全随机生成, 会导致生成的 ID 重复的可能. 这个生成 ID 重复的概率随着每秒生成的 ID 数的增长而增长.

所以, Simpleflake 的限制就是每秒生成的 ID 不能太多 (最好小于 100次/秒, 如果大于 100次/秒的场景, Simpleflake 就不适用了, 建议切换回 Snowflake).

3. instagram 的做法

先简单介绍一下 instagram 的分布式存储方案:

先把每个 Table 划分为多个逻辑分片 (logic Shard), 逻辑分片的数量可以很大, 例如 2000 个逻辑分片
然后制定一个规则, 规定每个逻辑分片被存储到哪个数据库实例上面; 数据库实例不需要很多. 例如, 对有 2 个 PostgreSQL 实例的系统 (instagram 使用 PostgreSQL); 可以使用奇数逻辑分片存放到第一个数据库实例, 偶数逻辑分片存放到第二个数据库实例的规则
每个 Table 指定一个字段作为分片字段 (例如, 对用户表, 可以指定 uid 作为分片字段)
插入一个新的数据时, 先根据分片字段的值, 决定数据被分配到哪个逻辑分片 (logic Shard)
然后再根据 logic Shard 和 PostgreSQL 实例的对应关系, 确定这条数据应该被存放到哪台 PostgreSQL 实例上

instagram unique ID 的组成:

41 bits: Timestamp (毫秒)
13 bits: 每个 logic Shard 的代号 (最大支持 8 x 1024 个 logic Shards)
10 bits: sequence number; 每个 Shard 每毫秒最多可以生成 1024 个 ID

生成 unique ID 时, 41 bits 的 Timestamp 和 Snowflake 类似, 这里就不细说了.

主要介绍一下 13 bits 的 logic Shard 代号和 10 bits 的 sequence number 怎么生成.

logic Shard 代号:

假设插入一条新的用户记录, 插入时, 根据 uid 来判断这条记录应该被插入到哪个 logic Shard 中.
假设当前要插入的记录会被插入到第 1341 号 logic Shard 中 (假设当前的这个 Table 一共有 2000 个 logic Shard)
新生成 ID 的 13 bits 段要填的就是 1341 这个数字

sequence number 利用 PostgreSQL 每个 Table 上的 auto-increment sequence 来生成:

如果当前表上已经有 5000 条记录, 那么这个表的下一个 auto-increment sequence 就是 5001 (直接调用 PL/PGSQL 提供的方法可以获取到)
然后把这个 5001 对 1024 取模就得到了 10 bits 的 sequence number

instagram 这个方案的优势在于:

利用 logic Shard 号来替换 Snowflake 使用的 Worker 号, 就不需要到中心节点获取 Worker 号了. 做到了完全去中心化
另外一个附带的好处就是, 可以通过 ID 直接知道这条记录被存放在哪个 logic Shard 上

同时, 今后做数据迁移的时候, 也是按 logic Shard 为单位做数据迁移的, 所以这种做法也不会影响到今后的数据迁移

31 Mar 2015 » URL Encoding
23 Jul 2014 » On Designing and Deploying Internet-Scale Services
12 Apr 2014 » Bash Notes

http://darktea.github.io/notes/2013/12/08/Unique-ID