1. Concurrency in modern hardware
1. What is concurrency?
function foo() { ... }
function bar() { ... }
function main() {
t1 = startThread(foo)
t2 = startThread(bar)
// 在继续执行 main() 之前等待 t1 和 t2 完成
waitUntilFinished(t1)
waitUntilFinished(t2)
}
In this example program, concurrency means that foo() and bar() execute at the same time. How does the CPU actually do this?
2. How to use concurrency to make your program faster?
Modern CPUs can execute multiple instruction streams simultaneously:
1. A single CPU core can execute multiple threads: Simultaneous Multithreading (SMT), which Intel calls hyperthreading
2. Of course, the CPU can also have multiple cores that can run independently
To get the best performance in programming, writing multithreaded programs is essential. To do this, a basic understanding of how hardware behaves in a parallel programming environment is required.
Most low-level implementation details can be found in the Intel Architecture Software Developer's Manual and the ARM Architecture Reference Manual.
3. Simultaneous multi-threaded SMT
CPUs support instruction-level parallelism by using out-of-order execution
Using SMT (Simultaneous Multi-Threading), the CPU also supports thread-level parallelism
1. In a single CPU core, execute multiple threads
2. Many hardware components, such as ALU, SIMD units, etc., are shared between threads
3. Duplicate other components for each thread, such as control unit fetch and decode instructions, register files
4. The problem of SMT
When using SMT, multiple instruction streams share some of the CPU cores.
1. SMT does not improve performance when a stream uses all compute units individually
2. The same memory bandwidth
3. Some units may only exist once on the core, so SMT will also degrade performance
4. This can lead to security issues when two threads from unrelated processes run on the same core! Similar to Spectre and Meltdown security concerns.
5. Cache Coherence
Different cores can access the same memory at the same time, multiple cores may share cache, cache is inclusive
The CPU must ensure that the cache is consistent with concurrent access! Communication between CPUs using the cache coherence protocol
6. MESI protocol
CPUs and caches always read and write at cache line granularity (i.e. 64 bytes)
The generic MESI cache coherence protocol assigns each cache line one of four states:
Modified: The cache line is only stored in one cache and has been modified in the cache, but has not been written back to main memory
Exclusive: Cache lines are only stored in one cache for exclusive use by one CPU
SharedShared: The cache line is stored in at least one cache, is currently used by the CPU for read-only access, and has not been modified
Invalid: The cache line was not loaded or used exclusively by another cache
(1) MONTHS Example (1)
(2) MONTHS Example (2)
7. Memory access and concurrency
Consider the following sample program, where foo() and bar() will execute simultaneously:
globalCounter = 0
function foo() {
repeat 1000 times:
globalCounter = globalCounter - 1
}
function bar() {
repeat 1000 times:
globalCounter = (globalCounter + 1) * 2
}
The machine code for this program might look like this:
What is the final value of globalCounter?
8、Memory Order
Out-of-order execution and simultaneous multiprocessing lead to unexpected execution of memory load and store instructions
All executed instructions will eventually complete
However, the effects of memory instructions (i.e. reads and writes) may become visible in an indeterminate order
The CPU vendor defines how reads and writes are allowed to be interleaved! memory order
In general: the relevant instructions in a single thread always work as expected:
store $123, A
load A, %r1
If the memory location at A is only accessed by this thread, then r1 will always contain 123
(1)Weak and Strong Memory Order
CPU architectures often have weak memory ordering (e.g. ARM) or strong memory ordering (e.g. x86)
Weak memory sequence:
Memory instructions and their effects can be reordered as long as dependencies are respected
Different threads will see writes in different orders
Strong memory order:
Within a thread, only lazy stores are allowed after subsequent loads, everything else is not reordered
When two threads perform the store at the same location, all other threads will see the result writes in the same order
All other threads will see writes from a set of threads in the same order
For Both:
Writes from other threads can be reordered
Concurrent memory accesses to the same location can be reordered
(2)Example of Memory Order (1)
In this example, initially the memory at A contains the value 1 and the memory at B contains the value 2.
Weak memory order:
Threads have no dependent instructions
Memory instructions can be arbitrarily reordered
r1 = 3, r2 = 2, r3 = 4, r4 = 1 are allowed
Strong memory order:
Threads 3 and 4 must see writes from threads 1 and 2 in the same order
Example where weak memory ordering is not allowed
r1 = 3, r2 = 2, r3 = 4, r4 = 3 are allowed
(3)Example of Memory Order (2)
Visualization of weak memory order example:
Thread 3 sees write A before write B. (4.) (1.)
Thread 4 sees write B before write A. (8.) (5.)
In strong memory, 5. no Allowed to happen before 8.
9、Memory Barriers
Multicore CPUs have special memory barrier (also known as memory fence) instructions that enforce stricter memory ordering requirements
This is especially useful for architectures with weak memory ordering
x86 has the following barrier instructions:
lfence: earlier loads cannot be reordered outside this instruction, and later loads and stores cannot be reordered before this instruction
sfence: earlier stores cannot be reordered after this directive, later stores cannot be reordered before this directive
mfence: loads or stores cannot be reordered after or before this instruction
ARM has data memory barrier instructions that support different modes:
dmb ishst: All writes that were visible in or caused by this thread before this instruction will be visible to all threads before any writes from stores that follow this instruction
dmb ish: All writes visible in or caused by this thread and related reads preceding this instruction will be visible to all threads before any reads and writes following this instruction
In order to additionally control out-of-order execution, ARM provides data synchronization barrier instructions: dsb ishst, dsb ish
10、Atomic Operations
Memory order only cares about memory loads and stores
There are no memory order restrictions on concurrent stores in the same memory location! order may be undefined
To allow deterministic concurrent modifications, most architectures support atomic operations
An atomic operation is usually a sequence: load data, modify data, store data
Also known as read-modify-write (RMW)
The CPU ensures that all RMW operations are performed atomically, i.e. no other concurrent loads and stores are allowed in between
Usually only a single arithmetic and bitwise instruction is supported
11、Compare-And-Swap Operations (1)
On x86, the RMW instruction may lock the memory bus
To avoid performance issues, there are only a few RMW instructions
To facilitate more complex atomic operations, Compare-And-Swap (CAS) atomic operations can be used
ARM does not support locking the memory bus, so all RMW operations are implemented with CAS
A CAS instruction has three parameters: memory location m, expected value e, and expected value d
Conceptually, CAS operations work as follows:
Note: CAS operations may fail, e.g. due to concurrent modifications
12、Compare-And-Swap Operations (2)
Because CAS operations can fail, they are usually used in a loop with the following steps:
1. Load value from memory location into local register
2. Use local register for computation assuming no other thread will modify memory location
3. Generate new expected value
for memory location 4. CAS memory location with value in local register as expected value Action
5. If the CAS operation fails, start the loop from the beginning
Note that steps 2 and 3 can contain any number of instructions and are not limited to RMW instructions!
13、Compare-And-Swap Operations (3)
A typical loop using CAS looks like this:
success = false
while (not success) { (Step 5)
expected = load(A) (Step 1)
desired = non_trivial_operation(expected) (Steps 2, 3)
success = CAS(A, expected, desired) (Step 4)
}
Using this approach, arbitrarily complex atomic operations can be performed on memory locations
However, the probability of failure increases, the more time is spent on unconventional operations
Also, unconventional operations may be performed more frequently than necessary
2. Parallel programming
Multithreaded programs often contain many shared resource
data structures
Operating system handles (such as file descriptors)
in separate memory locations
Need to control concurrent access to shared resources
Uncontrolled access leads to race conditions
Race conditions often end in inconsistent program states
Other results such as silent data corruption are also possible
Synchronization can be implemented in different ways by
operating system support, such as through mutexes
Hardware support, especially through atomic operations
1. Mutual exclusion (1)
Remove elements from linked list at the same time
Observation
C is not actually deleted by
thread also it is possible to free node memory after deletion
2. Mutual exclusion (2)
Protecting shared resources by only allowing access within critical sections.
Only one thread at a time can enter the critical section.
If used correctly, it is still possible to ensure that the program state is always consistent and
non-deterministic (but consistent) program behavior is still possible
There are multiple possibilities for implementing mutual exclusion.
Atomic test and set operations
often require spins, which can be dangerous for
OS support
eg. Mutex in Linux
3. Lock
A mutex is achieved by acquiring a lock on the mutex object
only one thread can acquire the mutex at a time.
Attempting to acquire a lock on a locked mutex will block the thread until the mutex is available again
. The blocked thread can be suspended by the kernel to free up computing resources
Multiple mutexes can be used to represent separate critical sections
Only one thread can enter the same critical section at a time, but threads can enter different critical sections at the same time
Allows for more fine-grained synchronization
Needs careful implementation to avoid deadlocks
4. Shared lock
Strict mutexes are not always necessary
Common concurrent read-only accesses to the same shared resource do not interfere with each other
Using strict mutexes can introduce unnecessary bottlenecks because reads also block each other
We just need to ensure that write accesses cannot Concurrent with other write or read accesses
Shared locks provide a solution. A
thread can acquire an exclusive lock on a mutex or a shared lock.
If the mutex is not exclusively locked, multiple threads can acquire a shared lock on the mutex at the same time.
If the mutex is not locked by any other mode (exclusive or shared) locking, then one thread at a time can obtain an exclusive lock on the mutex
5. Mutual exclusion problem (1)
deadlock
Multiple threads each wait for other threads to release locks
Avoid deadlocks
If possible, threads should not acquire multiple locks
If it cannot be avoided, locks must always be acquired in a globally consistent order
6. Mutual exclusion problem (2)
High contention for starved mutexes can result in some threads not making progress
This can be partially mitigated by using a less restrictive locking scheme
High latency
If mutex contention is intense, some threads will block for a long time,
which can cause significant system performance
degradation and may even be lower than single-threaded performance
Priority inversion
High-priority threads may be blocked by lower-priority threads
, which may not allow low-priority threads
to have sufficient computing resources to release locks quickly due to priority differences
7. Hardware-assisted synchronization
Using mutexes is usually relatively expensive
Each mutex requires some state (16 to 40 bytes)
Acquiring a lock can require a system call, which can take thousands of cycles or more
So mutexes are best for coarse-grained locking
eg. Locking the entire data structure instead of a part of
it is sufficient if there are only a few threads contending for the lock on
the mutex if there are more critical sections protected by the mutex, it is more
expensive than the (potentially) syscall to acquire the lock
The performance of the mutex degrades rapidly under high contention.
In particular, the latency of lock acquisition increases dramatically.
This even happens when we only acquire shared locks on the mutex.
We can take advantage of hardware support for more efficient synchronization.
8. Optimistic locking (1)
In general, read-only access to resources is more common than write access.
Therefore, we should optimize for the common case of read-only access.
In particular, parallel read-only access by many threads should be effective.
Shared locks are not suitable for this
Optimistic locking can provide efficient reader/writer synchronization
Associating a version with a shared resource
Writes must still acquire some kind of exclusive lock
This ensures that only one author at a time can access the resource
At the end of its critical section, the author automatically increments the version for
reads simply Verify that the version
is at the beginning of its critical section, the read atomically reads the current version
at the end of its critical section, the read verifies that the version has not changed
Otherwise, a concurrent write occurs and the critical section is restarted
9. Optimistic locking (2)
Example (pseudocode)
writer(optLock) {
lockExclusive(optLock.mutex) // begin critical section
// modify the shared resource
storeAtomic(optLock.version, optLock.version + 1)
unlockExclusive(optLock.mutex) // end critical section
}
reader(optLock) {
while(true) {
current = loadAtomic(optLock.version); // begin critical section
// read the shared resource
if (current == loadAtomic(optLock.version)) // validate
return; // end critical section
}
}
10. Optimistic locking (3)
Why does optimistic locking work?
A read only needs to execute two atomic load instructions
which is much cheaper than acquiring a shared lock
but requires few modifications, otherwise the read would have to be restarted frequently
Reader
shared resources may be modified when readers access it
We cannot assume that we read from a consistent state
More complex read operations may require additional intermediate validation
11. Beyond mutual exclusion
In many cases, strict mutual exclusion is not needed in the first place
eg. Parallel insertion into linked list
we don't care about the order of insertion
we just need to ensure that all insertions are reflected in the final state
This can be efficiently achieved by using atomic operations (pseudocode)
threadSafePush(linkedList, element) {
while (true) {
head = loadAtomic(linkedList.head)
element.next = head
if (CAS(linkedList.head, head, element))
break;
}
}
12. Non-blocking algorithm
Algorithms or data structures that do not rely on locks are called non-blocking
eg. The threadSafePush function above
Synchronization between threads is often implemented using atomic operations to
enable more efficient implementation of many common algorithms and data structures
Such algorithms can provide different levels of progress guarantees
waiting freedom: there is an upper bound on the number of steps required to complete each operation
, which is difficult to reach in practice
lock-free: if the program runs for enough time, at least one thread makes progress
Often informally (and technically incorrect) used as a synonym for non-blocking
13. ABA problem (1)
Non-blocking data structures require careful implementation.
We no longer have the luxury of critical sections.
Threads can perform different operations on the data structure in parallel (such as inserts and deletes)
. A single atomic operation containing these composite operations can be arbitrarily interleaved
. This can lead to hard-to-debug Anomalies such as lost updates or ABA issues
Problems can often be avoided by ensuring that only identical operations (such as inserts) are performed in parallel
E.g. Insert elements in parallel in the first step and delete them in parallel in the second step
14. ABA problem (2)
Consider the following simple linked list based stack (pseudocode)
threadSafePush(stack, element) {
while (true) {
head = loadAtomic(stack.head)
element.next = head
if (CAS(stack.head, head, element))
break;
}
}
threadSafePop(stack) {
while (true) {
head = loadAtomic(stack.head)
next = head.next
if (CAS(stack.head, head, next))
return head
}
}
15、A-B-A Problem (3)
Consider the following initial state of the stack on which two threads perform some operations in parallel
Our implementation will allow to perform interleaving as follows
16. The danger of spin (1)
A "better" mutex can be implemented that requires less space and does not use
system calls using atomic operations:
the mutex is represented by a single atomic integer
0 when unlocked and 1 when locked
to lock the mutex , change it to 1 only if the value is atomically changed to 0 using CAS
CAS repeats as long as another thread holds the mutex
function lock(mutexAddress) {
while (CAS(mutexAddress, 0, 1) not sucessful) {
<noop>
}
}
function unlock(mutexAddress) {
atomicStore(mutexAddress, 0)
}
17. The danger of spin (2)
Using this CAS loop as a mutex, also known as a spin lock, has several disadvantages:
1. It has no fairness, i.e. there is no guarantee that the thread will eventually acquire the lock
2. CAS cycle consumes CPU cycles (wasting energy and resources)
3. It is easy to cause priority inversion
The operating system's scheduler thinks that spinning threads require a lot of CPU time
Spinning threads don't actually do any useful work at all
In the worst case, the scheduler takes the CPU time of the thread holding the lock to give it to the spinning thread
4. Spin takes longer to spin, which makes the situation worse
3. Possible solution:
Spin for finite number of times (e.g. how many iterations)
Fallback to "true" mutual if lock cannot be acquired exclusion lock.
In fact, it is the usual implementation of mutex locks (such as biased locks, lightweight locks, heavyweight locks in Java, biased locks seem to have been canceled in the latest version).
Below is a complete implementation of a basic spinlock using C++11 atomics
struct spinlock {
std::atomic<bool> lock_ = {0};
void lock() noexcept {
for (;;) {
// 乐观地假设锁在第一次尝试时是空闲的
if (!lock_.exchange(true, std::memory_order_acquire)) {
return;
}
// 等待释放锁而不产生缓存未命中
while (lock_.load(std::memory_order_relaxed)) {
// 发出 X86 PAUSE 或 ARM YIELD 指令以减少超线程(hyper-threads)之间的争用
__builtin_ia32_pause();
}
}
}
bool try_lock() noexcept {
// 首先做一个简单的加载来检查锁是否空闲,以防止在有人这样做时不必要的缓存未命中 while(!try_lock())
return !lock_.load(std::memory_order_relaxed) &&
!lock_.exchange(true, std::memory_order_acquire);
}
void unlock() noexcept {
lock_.store(false, std::memory_order_release);
}
};
The purpose of a spinlock is to prevent multiple threads from accessing a shared data structure at the same time. In contrast to a mutex, the thread will be busy waiting and wasting CPU cycles instead of yielding the CPU to another thread. Unless you're sure you understand the consequences, don't use custom spinlocks but use atomic variables provided in various languages.