C/C++ Study Notes Concurrency in Modern Hardware

1. Concurrency in modern hardware

1. What is concurrency?

function foo() { ... }
function bar() { ... }

function main() {
    t1 = startThread(foo)
    t2 = startThread(bar)
    // 在继续执行 main() 之前等待 t1 和 t2 完成
    waitUntilFinished(t1)
    waitUntilFinished(t2)
}

        In this example program, concurrency means that foo() and bar() execute at the same time. How does the CPU actually do this?

2. How to use concurrency to make your program faster?

        Modern CPUs can execute multiple instruction streams simultaneously:

        1. A single CPU core can execute multiple threads: Simultaneous Multithreading (SMT), which Intel calls hyperthreading

        2. Of course, the CPU can also have multiple cores that can run independently

        To get the best performance in programming, writing multithreaded programs is essential. To do this, a basic understanding of how hardware behaves in a parallel programming environment is required.

        Most low-level implementation details can be found in the Intel Architecture Software Developer's Manual and the ARM Architecture Reference Manual.

3. Simultaneous multi-threaded SMT

        CPUs support instruction-level parallelism by using out-of-order execution

        Using SMT (Simultaneous Multi-Threading), the CPU also supports thread-level parallelism

        1. In a single CPU core, execute multiple threads

        2. Many hardware components, such as ALU, SIMD units, etc., are shared between threads

        3. Duplicate other components for each thread, such as control unit fetch and decode instructions, register files

4. The problem of SMT

        When using SMT, multiple instruction streams share some of the CPU cores.

        1. SMT does not improve performance when a stream uses all compute units individually

        2. The same memory bandwidth

        3. Some units may only exist once on the core, so SMT will also degrade performance

        4. This can lead to security issues when two threads from unrelated processes run on the same core! Similar to Spectre and Meltdown security concerns.

5. Cache Coherence

        Different cores can access the same memory at the same time, multiple cores may share cache, cache is inclusive 

        The CPU must ensure that the cache is consistent with concurrent access! Communication between CPUs using the cache coherence protocol

6. MESI protocol

        CPUs and caches always read and write at cache line granularity (i.e. 64 bytes)

        The generic MESI cache coherence protocol assigns each cache line one of four states:

                Modified: The cache line is only stored in one cache and has been modified in the cache, but has not been written back to main memory

                Exclusive: Cache lines are only stored in one cache for exclusive use by one CPU

                SharedShared: The cache line is stored in at least one cache, is currently used by the CPU for read-only access, and has not been modified

                Invalid: The cache line was not loaded or used exclusively by another cache

(1) MONTHS Example (1)

 (2) MONTHS Example (2)

7. Memory access and concurrency

        Consider the following sample program, where foo() and bar() will execute simultaneously:

globalCounter = 0
function foo() {
    repeat 1000 times:
    globalCounter = globalCounter - 1
}

function bar() {
    repeat 1000 times:
    globalCounter = (globalCounter + 1) * 2
}

        The machine code for this program might look like this:

         What is the final value of globalCounter?

8、Memory Order

        Out-of-order execution and simultaneous multiprocessing lead to unexpected execution of memory load and store instructions

        All executed instructions will eventually complete

        However, the effects of memory instructions (i.e. reads and writes) may become visible in an indeterminate order

        The CPU vendor defines how reads and writes are allowed to be interleaved! memory order

        In general: the relevant instructions in a single thread always work as expected:

store $123, A
load A, %r1

        If the memory location at A is only accessed by this thread, then r1 will always contain 123

(1)Weak and Strong Memory Order

        CPU architectures often have weak memory ordering (e.g. ARM) or strong memory ordering (e.g. x86)

        Weak memory sequence:

                Memory instructions and their effects can be reordered as long as dependencies are respected

                Different threads will see writes in different orders

        Strong memory order:

                Within a thread, only lazy stores are allowed after subsequent loads, everything else is not reordered

                When two threads perform the store at the same location, all other threads will see the result writes in the same order

                All other threads will see writes from a set of threads in the same order

        For Both:

                Writes from other threads can be reordered

                Concurrent memory accesses to the same location can be reordered

(2)Example of Memory Order (1)

In this example, initially the memory at A contains the value 1 and the memory at B contains the value 2.

Weak memory order:

        Threads have no dependent instructions

        Memory instructions can be arbitrarily reordered

        r1 = 3, r2 = 2, r3 = 4, r4 = 1 are allowed

Strong memory order:

        Threads 3 and 4 must see writes from threads 1 and 2 in the same order

        Example where weak memory ordering is not allowed

        r1 = 3, r2 = 2, r3 = 4, r4 = 3 are allowed

(3)Example of Memory Order (2) 

Visualization of weak memory order example:

        Thread 3 sees write A before write B. (4.) (1.)
        Thread 4 sees write B before write A. (8.) (5.)
        In strong memory, 5. no Allowed to happen before 8.

9、Memory Barriers

        Multicore CPUs have special memory barrier (also known as memory fence) instructions that enforce stricter memory ordering requirements

        This is especially useful for architectures with weak memory ordering

        x86 has the following barrier instructions:

        lfence: earlier loads cannot be reordered outside this instruction, and later loads and stores cannot be reordered before this instruction

        sfence: earlier stores cannot be reordered after this directive, later stores cannot be reordered before this directive

        mfence: loads or stores cannot be reordered after or before this instruction

        ARM has data memory barrier instructions that support different modes:

        dmb ishst: All writes that were visible in or caused by this thread before this instruction will be visible to all threads before any writes from stores that follow this instruction

        dmb ish: All writes visible in or caused by this thread and related reads preceding this instruction will be visible to all threads before any reads and writes following this instruction

        In order to additionally control out-of-order execution, ARM provides data synchronization barrier instructions: dsb ishst, dsb ish

10、Atomic Operations

        Memory order only cares about memory loads and stores

        There are no memory order restrictions on concurrent stores in the same memory location! order may be undefined

        To allow deterministic concurrent modifications, most architectures support atomic operations

        An atomic operation is usually a sequence: load data, modify data, store data

        Also known as read-modify-write (RMW)

        The CPU ensures that all RMW operations are performed atomically, i.e. no other concurrent loads and stores are allowed in between

        Usually only a single arithmetic and bitwise instruction is supported

11、Compare-And-Swap Operations (1)

         On x86, the RMW instruction may lock the memory bus

        To avoid performance issues, there are only a few RMW instructions

        To facilitate more complex atomic operations, Compare-And-Swap (CAS) atomic operations can be used

        ARM does not support locking the memory bus, so all RMW operations are implemented with CAS

        A CAS instruction has three parameters: memory location m, expected value e, and expected value d

        Conceptually, CAS operations work as follows:

        Note: CAS operations may fail, e.g. due to concurrent modifications

12、Compare-And-Swap Operations (2)

        Because CAS operations can fail, they are usually used in a loop with the following steps:

        1. Load value from memory location into local register
        2. Use local register for computation assuming no other thread will modify memory location
        3. Generate new expected value
        for memory location 4. CAS memory location with value in local register as expected value Action
        5. If the CAS operation fails, start the loop from the beginning

        Note that steps 2 and 3 can contain any number of instructions and are not limited to RMW instructions!

13、Compare-And-Swap Operations (3)

        A typical loop using CAS looks like this:

success = false
while (not success) { (Step 5)
    expected = load(A) (Step 1)
    desired = non_trivial_operation(expected) (Steps 2, 3)
    success = CAS(A, expected, desired) (Step 4)
}

        Using this approach, arbitrarily complex atomic operations can be performed on memory locations

        However, the probability of failure increases, the more time is spent on unconventional operations

        Also, unconventional operations may be performed more frequently than necessary

2. Parallel programming

        Multithreaded programs often contain many shared resource
                data structures
                Operating system handles (such as file descriptors)
                in separate memory locations

        Need to control concurrent access to shared resources
                Uncontrolled access leads to race conditions
                Race conditions often end in inconsistent program states
                Other results such as silent data corruption are also possible        

        Synchronization can be implemented in different ways by
                operating system support, such as through mutexes
                Hardware support, especially through atomic operations

1. Mutual exclusion (1)

        Remove elements from linked list at the same time

        Observation
                C is not actually deleted by
                thread also it is possible to free node memory after deletion 

2. Mutual exclusion (2)

        Protecting shared resources by only allowing access within critical sections.
                Only one thread at a time can enter the critical section.
                If used correctly, it is still possible to ensure that the program state is always consistent and
                non-deterministic (but consistent) program behavior is still possible

        There are multiple possibilities for implementing mutual exclusion.
                Atomic test and set operations
                        often require spins, which can be dangerous for
                OS support
                        eg. Mutex in Linux

3. Lock

        A mutex is achieved by acquiring a lock on the mutex object
                only one thread can acquire the mutex at a time.
                Attempting to acquire a lock on a locked mutex will block the thread until the mutex is available again
                . The blocked thread can be suspended by the kernel to free up computing resources

        Multiple mutexes can be used to represent separate critical sections
                Only one thread can enter the same critical section at a time, but threads can enter different critical sections at the same time
                Allows for more fine-grained synchronization
                Needs careful implementation to avoid deadlocks

4. Shared lock

        Strict mutexes are not always necessary
                Common concurrent read-only accesses to the same shared resource do not interfere with each other
                Using strict mutexes can introduce unnecessary bottlenecks because reads also block each other
                We just need to ensure that write accesses cannot Concurrent with other write or read accesses

        Shared locks provide a solution. A
                thread can acquire an exclusive lock on a mutex or a shared lock.
                If the mutex is not exclusively locked, multiple threads can acquire a shared lock on the mutex at the same time.
                If the mutex is not locked by any other mode (exclusive or shared) locking, then one thread at a time can obtain an exclusive lock on the mutex

5. Mutual exclusion problem (1)

        deadlock

        Multiple threads each wait for other threads to release locks

        Avoid deadlocks
                If possible, threads should not acquire multiple locks
                If it cannot be avoided, locks must always be acquired in a globally consistent order

6. Mutual exclusion problem (2)


        High contention for          starved mutexes can result in some threads not making progress
        This can be partially mitigated by using a less restrictive locking scheme

        High latency
        If mutex contention is intense, some threads will block for a long time,
        which can cause significant system performance
        degradation and may even be lower than single-threaded performance

        Priority inversion
        High-priority threads may be blocked by lower-priority threads
        , which may not allow low-priority threads
        to have sufficient computing resources to release locks quickly due to priority differences

7. Hardware-assisted synchronization

        Using mutexes is usually relatively expensive
               Each mutex requires some state (16 to 40 bytes)
                Acquiring a lock can require a system call, which can take thousands of cycles or more

        So mutexes are best for coarse-grained locking
                eg. Locking the entire data structure instead of a part of
                it is sufficient if there are only a few threads contending for the lock on
                the mutex if there are more critical sections protected by the mutex, it is more
                expensive than the (potentially) syscall to acquire the lock

        The performance of the mutex degrades rapidly under high contention.
                In particular, the latency of lock acquisition increases dramatically.
                This even happens when we only acquire shared locks on the mutex.
                We can take advantage of hardware support for more efficient synchronization.

8. Optimistic locking (1)

        In general, read-only access to resources is more common than write access.
                Therefore, we should optimize for the common case of read-only access.
                In particular, parallel read-only access by many threads should be effective.
                Shared locks are not suitable for this

        Optimistic locking can provide efficient reader/writer synchronization
                Associating a version with a shared resource
                Writes must still acquire some kind of exclusive lock
                        This ensures that only one author at a time can access the resource
                        At the end of its critical section, the author automatically increments the version for
                reads simply Verify that the version
                        is at the beginning of its critical section, the read atomically reads the current version
                        at the end of its critical section, the read verifies that the version has not changed
                        Otherwise, a concurrent write occurs and the critical section is restarted

9. Optimistic locking (2)

        Example (pseudocode)

writer(optLock) {
    lockExclusive(optLock.mutex) // begin critical section
    // modify the shared resource
    storeAtomic(optLock.version, optLock.version + 1)
    unlockExclusive(optLock.mutex) // end critical section
}

reader(optLock) {
    while(true) {
        current = loadAtomic(optLock.version); // begin critical section
        // read the shared resource
        if (current == loadAtomic(optLock.version)) // validate
            return; // end critical section
    }
}

10. Optimistic locking (3)

        Why does optimistic locking work?
                A read only needs to execute two atomic load instructions
                which is much cheaper than acquiring a shared lock
                but requires few modifications, otherwise the read would have to be restarted frequently

        Reader
                shared resources may be modified when readers access it
                We cannot assume that we read from a consistent state
                More complex read operations may require additional intermediate validation

11. Beyond mutual exclusion

        In many cases, strict mutual exclusion is not needed in the first place
                eg. Parallel insertion into linked list
                we don't care about the order of insertion
                we just need to ensure that all insertions are reflected in the final state

        This can be efficiently achieved by using atomic operations (pseudocode)

threadSafePush(linkedList, element) {
    while (true) {
        head = loadAtomic(linkedList.head)
        element.next = head
        if (CAS(linkedList.head, head, element))
            break;
    }
}

12. Non-blocking algorithm

        Algorithms or data structures that do not rely on locks are called non-blocking
                eg. The threadSafePush function above
                Synchronization between threads is often implemented using atomic operations to
                enable more efficient implementation of many common algorithms and data structures

        Such algorithms can provide different levels of progress guarantees
                waiting freedom: there is an upper bound on the number of steps required to complete each operation
                        , which is difficult to reach in practice

        lock-free: if the program runs for enough time, at least one thread makes progress
                Often informally (and technically incorrect) used as a synonym for non-blocking

13. ABA problem (1)

        Non-blocking data structures require careful implementation.
                We no longer have the luxury of critical sections.
                Threads can perform different operations on the data structure in parallel (such as inserts and deletes)
                . A single atomic operation containing these composite operations can be arbitrarily interleaved
                . This can lead to hard-to-debug Anomalies such as lost updates or ABA issues

        Problems can often be avoided by ensuring that only identical operations (such as inserts) are performed in parallel

                E.g. Insert elements in parallel in the first step and delete them in parallel in the second step

14. ABA problem (2)

        Consider the following simple linked list based stack (pseudocode)

threadSafePush(stack, element) {
    while (true) {
        head = loadAtomic(stack.head)
        element.next = head
        if (CAS(stack.head, head, element))
            break;
    }
}

threadSafePop(stack) {
    while (true) {
        head = loadAtomic(stack.head)
        next = head.next
        if (CAS(stack.head, head, next))
            return head
    }
}

15、A-B-A Problem (3)

        Consider the following initial state of the stack on which two threads perform some operations in parallel

         Our implementation will allow to perform interleaving as follows

 16. The danger of spin (1)

        A "better" mutex can be implemented that requires less space and does not use
                system calls using atomic operations:
                the mutex is represented by a single atomic integer
                0 when unlocked and 1 when locked
                to lock the mutex , change it to 1 only if the value is atomically changed to 0 using CAS
                CAS repeats as long as another thread holds the mutex

function lock(mutexAddress) {
    while (CAS(mutexAddress, 0, 1) not sucessful) {
        <noop>
    }
}
function unlock(mutexAddress) {
    atomicStore(mutexAddress, 0)
}

17. The danger of spin (2)

        Using this CAS loop as a mutex, also known as a spin lock, has several disadvantages:
        1. It has no fairness, i.e. there is no guarantee that the thread will eventually acquire the lock

        2. CAS cycle consumes CPU cycles (wasting energy and resources)

        3. It is easy to cause priority inversion

                The operating system's scheduler thinks that spinning threads require a lot of CPU time

                Spinning threads don't actually do any useful work at all

                In the worst case, the scheduler takes the CPU time of the thread holding the lock to give it to the spinning thread

        4. Spin takes longer to spin, which makes the situation worse
        3. Possible solution:
                Spin for finite number of times (e.g. how many iterations)
                Fallback to "true" mutual if lock cannot be acquired exclusion lock.
                In fact, it is the usual implementation of mutex locks (such as biased locks, lightweight locks, heavyweight locks in Java, biased locks seem to have been canceled in the latest version).

        Below is a complete implementation of a basic spinlock using C++11 atomics

struct spinlock {
  std::atomic<bool> lock_ = {0};

  void lock() noexcept {
    for (;;) {
      // 乐观地假设锁在第一次尝试时是空闲的
      if (!lock_.exchange(true, std::memory_order_acquire)) {
        return;
      }
      // 等待释放锁而不产生缓存未命中
      while (lock_.load(std::memory_order_relaxed)) {
        // 发出 X86 PAUSE 或 ARM YIELD 指令以减少超线程(hyper-threads)之间的争用
        __builtin_ia32_pause();
      }
    }
  }

  bool try_lock() noexcept {
    // 首先做一个简单的加载来检查锁是否空闲,以防止在有人这样做时不必要的缓存未命中 while(!try_lock())
    return !lock_.load(std::memory_order_relaxed) &&
           !lock_.exchange(true, std::memory_order_acquire);
  }

  void unlock() noexcept {
    lock_.store(false, std::memory_order_release);
  }
};

        The purpose of a spinlock is to prevent multiple threads from accessing a shared data structure at the same time. In contrast to a mutex, the thread will be busy waiting and wasting CPU cycles instead of yielding the CPU to another thread. Unless you're sure you understand the consequences, don't use custom spinlocks but use atomic variables provided in various languages.

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/127187791