A distributed read unique ID generating text

Many large Internet companies a large amount of data, all use of sub-library sub-table, after the sub-library needs a unified unique ID for storage. The ID numbers can be incremented, it can be UUID type.

If incremented, then the database after the split, the hash id can follow a uniform distribution into the database, and if the incremented mysql database fields stored as the primary key words will greatly improve the storage speed. However, if in accordance with the order ID number can add up, then someone else can easily guess how many orders you have, and this situation may require a non-digital way to generate incremental ID's.

Expect to generate distributed ID, we may think of using Redis be generated ID, use Redis of INCR command to generate and get this auto-incremented ID, this is no problem, but the INCR generated QPS rate of 200400 (official website of test results ), which is 20W like this, if QPS does not exceed these words, apparently using Redis more appropriate.

So for us to achieve high availability, high QPS, low-latency we have no better idea of ​​it. Then with a look snowflake algorithm, open-source company by twitter Snowflake algorithm.

A distributed read unique ID generating text

snowflake total of 64:
  • 1. The first one goes.

  • 2.41 timestamp. 2 ^ 41 milliseconds, then, obtained in 69, is enough.

  • 3.10-bit working machine, there may be 1024 = 2 ^ 10 nodes work, some will split into five coded work center, five points to the working machine.

  • 4. Finally, 12 data were used to generate the incremental number 4096.

If the QPS QPS is on theoretical 409W / S.

Advantages of this approach are:
  • 1. QPS is very high, the performance is enough. High-performance conditions are satisfied.

  • 2. do not need to rely on other third-party middleware, such as Redis. Less reliance, availability increased.

  • 3. Can be adjusted according to their custom. That is, 10 inside the freely distributed.
Disadvantages:

This algorithm is dependent on the clock, the clock if a call back will be possible to generate the same ID.

UUID is a 32-bit binary data generated, it generates a very good performance, but it is generated based on the MAC address of the machine, and not distributed, the scope of our discussion is not.

  Here we look at some large companies distributed implementation mechanism ID, creating a table by generating, using 8 Byte, 64 storage bits used by this table records the generated location ID, such as ID from 0, then use of 1000, the database records inside inside the maximum is 1000, as well as a step value, such as 1000, the acquisition is worth the next time the maximum value is 2001, the maximum value that is not being used.

Specific steps are as follows:
  • 1. generating a distributed service ID, the service ID value is read and the step value calculated inside the database to generate desired values ​​and ranges, and then performs a service consumer segment number stored in the cache after use to get .

  • After 2. When the caller to the service, database update data immediately.
The advantage in this case:
  • 1. disaster recovery performance, DB if there is a problem, because the data into memory, you can still support a period of time.

  • 2.8 Byte ID used to meet the service creation.

  • 3. The maximum you can define your own, so some migration operations can continue to use their own definition of the maximum.
Of course, there are also disadvantages:
  • 1. When a database linked to the whole system will not be used.

  • 2. The increasing number section, easy to guess other people.

  • 3. If you get this many services simultaneously access the network ID or fluctuations lead to increased IO database, the system will be stability problems.

Then the solution for the above is that they employ a dual caching mechanism, will soon number section to read after memory to get started, when using the 10% of the time to re-start a new thread, and then when a cache runs out to go with another a cached data. When another cache of data up to 10% of the time and then restart a new thread gets excited, in turn repeatedly.

The benefit of this is to avoid simultaneous access to a large number of databases, resulting in increased I / O. At the same time solve the case of a single cache can lead to quickly run through two cache segments. Of course, this number of segments arranged QPS 600 times the size of this database can be linked to or continue to provide services within 10-20 minutes.

Mentioned above has a problem that the ID is incremented, how we solve this problem. It is to use snowflake, then the clock to solve the problems inside, some companies ZK workerId to compare the current time is used if there is a node ID to call back, call back if there will be a fixed time to sleep, time to see if you can catch up, if they can catch up with it, continue to be generated ID, if there has been no catch reaches a certain worth, then on the error handling. Since the intermediate 10 is a different node, then the node generates a different ID does not exist in the case of increment.

These ideas are a company already achieved, if there is interest in continuing studies, then in the search for the next GITHUB open source Leaf can be used directly hold.

Guess you like

Origin blog.51cto.com/14230003/2424482