[Translation] Linus politely criticized a developer on spinlock

I heard recently Linus patiently but politely criticized a developer. Original here: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

Today more interested to translate the text again. Not difficult to understand, but the middle still added a lot Annotation, are some of the author's own understanding, we can not guarantee full right. Generally should be about it. I first posted the following translations paste text.

Translation:
whole article is wrong, and the measured content with the author believes to be measured and claim a completely different scenario.

First, you have to know the spin lock should be used only when they are not scheduled. ( Annotation: meant here is that, with the spin lock thread should always occupy the CPU, and should not be CPU scheduler scheduled to leave here, "scheduling" refers to the meaning of "is scheduled to leave the CPU" is. ) However, the author of the blog seems to user space to realize their spin locks, lock users do not care about that ( translation: here the user can be understood as the thread that owns the lock ) it has been scheduled. The time code is on its claim of "no man take lock" is complete garbage.

It records about the time before the release of the lock, and then recorded after acquiring the lock at the time, and claimed that the time difference between the two is that no one holds the lock time. This is simply stupid, pointless and completely wrong.

This is pure rubbish. How it happened:

. a spin is due, therefore is using CPU
b CPU scheduler will take away the right to use at a random moment. ( translation: OUT Schedule )
. c after the random time may be just what the recording time, but the release of Spin before the lock

So now still holds the lock, but because they have run out of time slice, and therefore no longer occupy the CPU. Just record "current time" is outdated, it's not really release the lock record time.

This time, someone wants to take the "spin lock", but the lock has not been released, so that person will now wait for some time to spin - spin lock is still occupied because the previous thread, and has not been released, just now there is no previous thread CPU-only. Since some point in the future, the scheduler will schedule again that previously occupied the spin lock thread, then releases the spin lock. Then wait by a thread to acquire the lock, then it records what the current time, and compare the time before recording, and then said:. "Oh, this is not the lock holder has been a long way" ( translation: in fact, the lock has been possession, just after the occupants at a time has not run on the CPU records. )

Note that the above scenario may still be good. If you have more threads than CPU (probably unrelated to other processes), it may be scheduled up next thread is not a thread to release the lock. Scheduling up the next thread might want to take the lock is another thread, and the thread that owns the lock is still not running!

Therefore, the code in question is pure rubbish. You can not do spin lock. Or rather, you can do that, but in the implementation of these actions, you are measuring the delay and get the random value meaningless, because you want to measure is the "I'm busy working, all processes are CPU-bound, and I'm measuring process scheduler will remain in place for how long the random point. "

Then you wrote a blog, accusing other people, and without understanding your own error code is actually garbage, trash and provides a random value.

Then, you test different schedulers, and you get a different random value think is interesting, because you think they show some of the great things about the scheduler.

But not so. You just get a random value, because different schedulers "use a longer process time slice if I want CPU-bound" to have different heuristic algorithms. Especially when each thread spins stupid and wrong benchmarks, which looks like a pure throughput benchmark, not actually wait for each other.

You may even see a problem such as "When I run as a foreground process when the UI, and the value obtained by running in the background as a batch process different" sort of. Cool, interesting figures, is not it?

No, they are not cool, not funny, you just created a particularly bad random number generator.

So, what solution is?

In the system tells you, "You are waiting for lock" position using the lock, but here, the thread releases the lock will tell you when the lock is released; so that the dispatcher can accurately work for you rather than randomly as you work .

Note that when using the real author of std :: mutex, no matter what the scheduler, things can run well. Because now you're doing what you want to do. Yes, the timing values may still useless - bad luck, bad luck go - but at least now know that you are using the scheduler lock.
( Translation: It is about the meaning of section 2, using a system of locks, so let the thread scheduler will always work holds the lock, rather than leave it to CPU scheduling, it will not cause the CPU to waste occurs; and timing value still may not be accurate, because the code and release the lock code recording time can not do atomicity. )

Or, if you really want to use spin locks (Tip: In fact, you do not need), be sure to hold the lock at the same time, you will always occupy CPU. To do this you need to use the real-time scheduler (or running in the kernel: the kernel, spin locks work well, because the kernel itself can say, "Hey, I run a spin lock, you can now schedule I ( translation: I can not take up the CPU, I can not leave scheduling the CPU ) ").

However, if you use the real-time scheduler, you need to pay attention to its other meanings. There are many, some of which are fatal. I strongly recommend not to try. In any case, you are likely to get it wrong a lot of other problems, but now some errors (such as unfair or [priority inversion]) may make you a complete failure of the whole thing, and things from "As the lock and poorly run slow "to" completely unable to work because I do not consider a lot of other things. "

Please note that even the OS kernel may also encounter this problem - Imagine a virtualized environment, the hypervisor to the physical CPU allocated to a virtual way overcommitted CPU What happens? Yes, it is true, do not do that. Or at least to be aware of this, and with the use of virtualization-aware paravirtualized spin lock, so that you can tell the hypervisor, "Hey, now do not dispatch me, I'm in a critical area."
( Annotation: This paragraph is about to say, in a virtualized environment, when the physical CPU over-allocated virtual CPU, when the virtual CPU to do a spin lock, probably because the number of threads to do more than spin locks physical CPU limit (i.e., over-commit) caused by some of the running thread is suspended spinlock problem with physical CPU. )

Because otherwise the time, maybe after you have completed all the work, just to release the lock, you are the CPU scheduler schedules leave; so that all want to take the lock thread will block you live in this period is scheduled to leave the CPU, so this everyone period of time without any progress spin on the CPU.
( Translation: It actually means that the thread has a spin lock should not be scheduled to leave the CPU and to achieve this, in addition to the use of real-time operating system, can only use a spin lock from inside the kernel because of the use in user space. spin lock, is no guarantee that the scheduler does not put the CPU thread scheduling leave all of this will lead to no progress, that led to a waste of CPU time - want to take all the threads are locked in a spin, while holding a lock the thread is scheduled to leave the CPU. )

Really, is that simple.

This is absolutely nothing to do with the delay cache coherency. It's about the realization and lock error.

I repeat: Do not use a spin lock in user space, unless you actually know what you're doing. And you know, you know what you are doing possibility of essentially zero.

There is a very real reason why you need to use sleeping locks (eg pthread_mutex, etc.).

In fact, I'd talked a bit: Never create their own locks. Whether or not they are spin locks, you will go wrong. You will get the wrong memory ordering, or you will get the wrong fairness, or you will encounter, such as the aforementioned "busy cycle, and has a lock thread CPU has been scheduled to leave the" type of problems.

No, at the time of spin-spin lock adding random "sched_yield ()" call does not help. When people succumb to all the wrong process, it can easily lead to scheduling storm.

Regrettably, even the system locking is not necessarily great. For example, for many benchmarks, you need to lock unfair, because it can greatly improve throughput. But this may lead to poor delays. And the standard system locking (eg pthread_mutex_lock () does not flag said, "I am concerned about fairness locked because the delay is more important than throughput").
( Translation: The meaning here is about, system locking is also not perfect, and in some cases it will also show the user the desired behavior because it can only take a compromise solution, but will not change as needs and. change their behavior. just as there is no flag to make it more concerned or more concerned about throughput delay. )

So even if you properly use a lock and to avoid completely wrong technically, for your load, you may also get the wrong locking behavior. Throughput and latency does indeed have the opposite locking very negative trend. An unfair lock will remain in a locked thread (or leave it on one CPU), can provide better local caching behavior and better throughput numbers.

However, when other CPU core want to take the lock, lock unfairly biased in favor of the use of thread-local and local CPU core may be a direct result of a delay spike, but it will help to retain the caching behavior on the local CPU core. On the contrary, fair locks to avoid delay spikes, but it will cause a lot of CPU cache coherency across, because usually take the lock area will actively migrate from one CPU to another CPU.
( Annotation: These paragraphs in a repeatedly stated meaning, reason unfair lock because contemporary CPU architecture will make the high local (local CPU core) efficiency, but can cause delays elsewhere in the global high; and fair locks Instead, the entire system delay is not high, but the throughput of the entire system corresponding to low. )

In general, unfair locking cause serious delays, resulting in large-scale system is completely unacceptable. But for smaller systems, unfair may not be so obvious, but the performance advantage is obvious, so the system vendors will choose not fair, but a faster lock queuing algorithms.

(Almost every time we have a choice in the kernel is not fair but fast locking model, we will eventually regret it, and had to increase fairness).
Therefore, you may want to implement a non-standard library research, but rather to consider specific ways to meet the specific needs of the lock. Admittedly, this is really, really annoying. But do not write your own. Find other people have written, and it took decades to be the actual adjustment and make it functional.

Because you should never think they are smart enough, you can write your own locking routines because probably you did not so smart (where "you", including my own - we are locks on all the cores have been studied for several years, and for ticket locks, cacheline-efficient queuing locks, have been a simple test and settings, and even people who know what they are doing, they would frequently repeated mistakes.

Why you can find dozens of academic papers about locks, which is a reason. Really, this is difficult.

Linus

原文:
The whole post seems to be just wrong, and is measuring something completely different than what the author thinks and claims it is measuring.

First off, spinlocks can only be used if you actually know you’re not being scheduled while using them. But the blog post author seems to be implementing his own spinlocks in user space with no regard for whether the lock user might be scheduled or not. And the code used for the claimed “lock not held” timing is complete garbage.

It basically reads the time before releasing the lock, and then it reads it after acquiring the lock again, and claims that the time difference is the time when no lock was held. Which is just inane and pointless and completely wrong.

That’s pure garbage. What happens is that

(a) since you’re spinning, you’re using CPU time

(b) at a random time, the scheduler will schedule you out

© that random time might ne just after you read the “current time”, but before you actually released the spinlock.

So now you still hold the lock, but you got scheduled away from the CPU, because you had used up your time slice. The “current time” you read is basically now stale, and has nothing to do with the (future) time when you are actually going to release the lock.

Somebody else comes in and wants that “spinlock”, and that somebody will now spin for a long while, since nobody is releasing it - it’s still held by that other thread entirely that was just scheduled out. At some point, the scheduler says “ok, now you’ve used your time slice”, and schedules the original thread, and now the lock is actually released. Then another thread comes in, gets the lock again, and then it looks at the time and says “oh, a long time passed without the lock being held at all”.

And notice how the above is the good schenario. If you have more threads than CPU’s (maybe because of other processes unrelated to your own test load), maybe the next thread that gets shceduled isn’t the one that is going to release the lock. No, that one already got its timeslice, so the next thread scheduled might be another thread that wants that lock that is still being held by the thread that isn’t even running right now!

So the code in question is pure garbage. You can’t do spinlocks like that. Or rather, you very much can do them like that, and when you do that you are measuring random latencies and getting nonsensical values, because what you are measuring is “I have a lot of busywork, where all the processes are CPU-bound, and I’m measuring random points of how long the scheduler kept the process in place”.

And then you write a blog-post blamings others, not understanding that it’s your incorrect code that is garbage, and is giving random garbage values.

And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.

But no. You’re just getting random values because different schedulers have different heuristics for “do I want to let CPU bound processes use long time slices or not”? Particularly in a load where everybody is just spinning on the silly and buggy benchmark, so they all look like they are pure throughput benchmarks and aren’t actually waiting on each other.

You might even see issues like “when I run this as a foreground UI process, I get different numbers than when I run it in the background as a batch process”. Cool interesting numbers, aren’t they?

No, they aren’t cool and interesting at all, you’ve just created a particularly bad random number generator.

So what’s the fix for this?

Use a lock where you tell the system that you’re waiting for the lock, and where the unlocking thread will let you know when it’s done, so that the scheduler can actually work with you, instead of (randomly) working against you.

Notice, how when the author uses an actual std::mutex, things just work fairly well, and regardless of scheduler. Because now you’re doing what you’re supposed to do. Yeah, the timing values might still be off - bad luck is bad luck - but at least now the scheduler is aware that you’re “spinning” on a lock.

Or, if you really want to use use spinlocks (hint: you don’t), make sure that while you hold the lock, you’re not getting scheduled away. You need to use a realtime scheduler for that (or be the kernel: inside the kernel spinlocks are fine, because the kernel itself can say “hey, I’m doing a spinlock, you can’t schedule me right now”).

But if you use a realtime scheduler, you need to be aware of the other implications of that. There are many, and some of them are deadly. I would suggest strongly against trying. You’ll likely get all the other issues wrong anyway, and now some of the mistakes (like unfairness or [priority inversions) can literally hang your whole thing entirely and things go from “slow because I did bad locking” to “not working at all, because I didn’t think through a lot of other things”.

Note that even OS kernels can have this issue - imagine what happens in virtualized environments with overcommitted physical CPU’s scheduled by a hypervisor as virtual CPU’s? Yeah - exactly. Don’t do that. Or at least be aware of it, and have some virtualization-aware paravirtualized spinlock so that you can tell the hypervisor that “hey, don’t do that to me right now, I’m in a critical region”.

Because otherwise you’re going to at some time be scheduled away while you’re holding the lock (perhaps after you’ve done all the work, and you’re just about to release it), and everybody else will be blocking on your incorrect locking while you’re scheduled away and not making any progress. All spinning on CPU’s.

Really, it’s that simple.

This has absolutely nothing to do with cache coherence latencies or anything like that. It has everything to do with badly implemented locking.

I repeat: do not use spinlocks in user space, unless you actually know what you’re doing. And be aware that the likelihood that you know what you are doing is basically nil.

There’s a very real reason why you need to use sleeping locks (like pthread_mutex etc).

In fact, I’d go even further: don’t ever make up your own locking routines. You will get the wrong, whether they are spinlocks or not. You’ll get memory ordering wrong, or you’ll get fairness wrong, or you’ll get issues like the above “busy-looping while somebody else has been scheduled out”.

And no, adding random “sched_yield()” calls while you’re spinning on the spinlock will not really help. It will easily result in scheduling storms while people are yielding to all the wrong processes.

Sadly, even the system locking isn’t necessarily wonderful. For a lot of benchmarks, for example, you want unfair locking, because it can improve throughput enormously. But that can cause bad latencies. And your standard system locking (eg pthread_mutex_lock() may not have a flag to say “I care about fair locking because latency is more important than throughput”.

So even if you get locking technically right and are avoiding the outright bugs, you may get the wrong kind of lock behavior for your load. Throughput and latency really do tend to have very antagonistic tendencies wrt locking. An unfair lock that keeps the lock with one single thread (or keeps it to one single CPU) can give much better cache locality behavior, and much better throughput numbers.

But that unfair lock that prefers local threads and cores might thus directly result in latency spikes when some other core would really want to get the lock, but keeping it core-local helps cache behavior. In contrast, a fair lock avoids the latency spikes, but will cause a lot of cross-CPU cache coherency, because now the locked region will be much more aggressively moving from one CPU to another.

In general, unfair locking can get so bad latency-wise that it ends up being entirely unacceptable for larger systems. But for smaller systems the unfairness might not be as noticeable, but the performance advantage is noticeable, so then the system vendor will pick that unfair but faster lock queueing algorithm.

(Pretty much every time we picked an unfair - but fast - locking model in the kernel, we ended up regretting it eventually, and had to add fairness).

So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don’t write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you’re clever enough to write your own locking routines… Because the likelihood is that you aren’t (and by that “you” I very much include myself - we’ve tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

There’s a reason why you can find decades of academic papers on locking. Really. It’s hard.

Linus

Published 169 original articles · won praise 332 · views 480 000 +

Guess you like

Origin blog.csdn.net/nirendao/article/details/103900311