Solution to the clock callback problem of Snowflake production plan

The first method is to turn off clock synchronization to avoid clock synchronization problems, but this is not realistic, because systems that are strongly dependent on time generally have to do clock synchronization to avoid serious time errors, deploy some things on virtual machines, play After the virtual machine is hibernated and resumed again, the time in the virtual machine and the time of the host are often out of sync, causing some large data distributed systems to crash, and the communication between nodes will rely on time stamps for comparison and heartbeat Slow, it will cause the node to hang

The second method is to record the time when the ID was last generated. If it is found that the time stamp is smaller than the last time stamp when the ID is generated this time, it means that the clock has been dialed back. At this time, ID generation is not allowed within this time. Wait, wait until the current time catches up with the last generation time. The problem is, what if the callback time is too much? It may take a long time, which affects the availability of the system, so it is not a particularly good way to store the timestamp of the last generated unique ID in the memory. When the clock is dialed back, the current timestamp will be dialed back to before the last timestamp. Request comes, to generate a unique ID, you do not directly return an ID to him, you do a comparison first, if you find that the current timestamp is compared with the timestamp of the last generated unique ID, it is smaller than him, and the clock is called back , as long as you generate an ID, it is possible that if the repeated availability of the ID is so poor, if someone’s business service wants to generate a billing data at this time and apply for an ID, at this time you finally waited for hundreds of milliseconds, you Also tell him that you have an internal exception and cannot obtain the unique ID. Repeated retries will affect the operation of his business services.

The third method is for the optimization of the second method. If you find that the clock callback is too harsh, such as exceeding 1 minute, you will call the police directly at this time, and at the same time no longer provide external services, and remove yourself from the cluster, such as If you register based on the microservice registration center, you have to take the initiative to go offline. When you find that the current timestamp is smaller than the timestamp of the last generated ID, and you find that the clock has been dialed back, judge how many milliseconds have been dialed back. For example, if the callback time is within 500ms, you can hang the request at this time and wait for 500ms. After 500ms, the current timestamp is greater than the timestamp of the last generated ID. At this time, you can normally
generate a unique ID and return it to the business side. , for the business side, only in the case of a few clock callbacks, the request usually only takes 50ms, 500ms, which is still within the acceptable range, so it is still possible, but the request is slower. If
you It is found that your current timestamp is compared with the timestamp of the last generated unique ID. When you compare it, you will find that it exceeds 500ms, but within 5s, you can return an abnormal state + abnormal continuation Give the client time, don't say there is a problem, you can notify him to retry the retry
mechanism by himself, it is best not to let the business side do it yourself, you can completely encapsulate a client of your unique ID generation service, based on RPC request Your interface, but you encapsulate an automatic retry mechanism in your own client. Once he finds that a server returns a response saying that he cannot provide the service in a short time, he will automatically request the service on other machines to obtain Unique ID
If you want to solve the clock callback, the second and third methods are generally used together, but passive waiting or even active offline will always affect the availability of the system, and it is not particularly good for clock callback detection on the server
side Mechanism + The client encapsulates itself
within 1s: the blocking request waits, and the timeout period of the client should also be 1s, which exposes the largest serial number of the unique ID generated every millisecond within 1s, and locates the previously generated ID according to the milliseconds of the current timestamp The maximum ID sequence number of this millisecond, continue to generate ID at this time, and directly increment on the basis of the maximum ID sequence number of this millisecond generated before. After optimization, it can be guaranteed that there is no need to block and wait
Between 1s and 10s: return the exception code and the duration of the exception, the client does not request this machine within the specified time for
more than 10s: return the fault code, request the service registration center to let itself go offline, after the client receives the fault code, it will Just delete this machine from the list of service machines, and don’t request him anymore. Afterwards, when the ID service deployed on that machine, he finds that his time may have passed a few seconds. After slowing down, recovering, and being available, it can be done. Register the service again. When your client refreshes the service registration list, you will find him. At this time, you can request him again.

The fourth method is to maintain the ID value generated in the last few seconds in the memory. Generally, the clock callback is tens of milliseconds to hundreds of milliseconds, and rarely exceeds seconds, so it is enough to save the last few seconds, and then If the clock callback occurs, check which millisecond the callback is at this time, because the timestamp is at the millisecond level, and then just look at that millisecond and
continue to generate the ID serial number that was produced in that millisecond. Every subsequent millisecond is followed by analogy, so that the problem of repetition can be perfectly avoided, and there is no need to wait, but there
is also a bottom-up mechanism here, that is, if you keep the ID generated every millisecond in the last 10s, then in case the clock callback happens to What about more than 10s? At this time, the probability is very low. You can combine the two or three schemes and set several thresholds. For example, if you keep the ID of the last 10s, you can ensure that there will be no repetitions and no pauses within 10s of the callback; if it exceeds 10s, Within 60s, there can be a waiting process to let him advance to the range of 10s you reserved before; if the call back exceeds 60s, the timestamp of the last unique ID generated directly
offline The maximum ID serial number per millisecond is gone. After restarting, the time callback occurs, and the time callback problem cannot be found. Secondly, there is no way to continue to generate unique IDs that are not repeated according to the previous thinking.

Guess you like

Origin blog.csdn.net/itlijinping_zhang/article/details/122414255