Basic knowledge of parallel multi-core architecture

Classification

According to Flynntaxonomy

FlynnThe taxonomy defines the classification of parallel computers based on the number of instruction streams and data streams.

Number of data streams
one many
Number of instruction streams one SISD SIMD
many MISD MIMD

SISD: Single instruction stream and single data stream. Ancient architecture.

SIMD: Single instruction stream and multiple data streams. Generally in GPU, such as Intel MMX/SSE, AMD 3DNow!etc.

MISD: Multiple instruction streams and single data stream. Difficult to apply, eg iWarp.

MIMD: Multiple instruction streams and multiple data streams.

MIMD computer classification

According to the number of processors, shared memory processors are divided into two categories: SMP/UMAandDSM/NUMA

SMP/UMA: Shared memory multiprocessor, or centralized shared memory multiprocessor or consistent memory access multiprocessor (generally the number of cores does not exceed 8). Shared cache and main memory.

DSM/NUMA: Multiprocessors use physically distributed memory, called distributed shared memory, also called non-uniform memory access.

Parallel programming

Fundamentally, the expectation of executing an algorithm or code in parallel is to obtain a shorter execution time than a serial algorithm or code. A useful tool for analyzing parallel program execution times is the Amdahllaw

Parallel programming model

Shared storage and messaging model.

Shared memory model: Different threads or parallel tasks executing can access any location in the memory, and they can implicitly communicate with each other by writing and reading memory locations (similar to the shared address space between multiple threads belonging to the same process) )

Message passing model: Threads have their own local memory, and one thread cannot access the memory of another thread. When threads exchange data, they need to communicate with each other by explicitly passing messages containing data values ​​(similar to multiple processes that do not share address space with each other).

Model comparison

Shared storage model Message passing model
communication Implicit explicit message
Synchronize explicit implicit message
Hardware support usually need unnecessary
Programming effort lower higher
Tuning workload higher lower
communication granularity Thinner thicker

Shared memory generally requires specialized hardware support. On similar multi-core processors, the last level cache may have been shared between processor cores. However, in multi-node case. Each node has its own processor and memory, and the nodes are interconnected to form a shared storage system. At this time, hardware support is needed to implement an image, that is, the memory of all nodes forms a single processor that can be addressed by all processors. .
When the number of processors is large, it becomes difficult to implement shared memory abstraction at a low cost.

Other programming models:

  • Partitioned global address space PGAS: allows all threads to transparently share a single address space
  • Data Parallel Programming Model: Similar toSIMD
  • MapReduce: cluster
  • Transactional memory TM: Define a piece of code as a transaction

Shared storage parallel model

Take OpenMP, for example, an application programming interface that supports shared memory programming and consists of a set of compiler directives that programmers can use to express OpenMPparallelism to supported compilers. The compiler replaces instructions with code that calls library functions or reads environment variables that affect the program's runtime behavior.

OpenMP官网: OpenMP official website link

The original purpose of the design was to express parallelism OpenMPin a loop structure , using an execution model in which the computer is executed by one thread (the main thread) in the serial part. When a parallel segment is encountered, the main thread generates child threads to execute together until the end of the parallel segment, and the child threads are merged back into the main thread.DOALLfork-join

general use

#pragma omp directive-name [clause[[,] clause] ... ] new-line
// when for is the directive-name, what clause? 
// private(variable-list)
// firstprivate(variable-list) 
// lastprivate(variable-list)
// reduction(operator: variable-list)
// ordered
// schedule(kind[, chunk_size])
// nowait

To represent a parallel segment#pragma omp parallel

#pragma omp parallel 
{
    
     // begin
	// parallel content
} // end

Parallel Programming for LDS

TMTransactional memory can be used to simplify LDSparallel programming to some extent by encapsulating TMeach LDSoperation in a transaction, such as

atomic{
    
    Insert(...)}  // insert a factor

atomic{
    
    Delete(...)}  // delete a factor

storage hierarchy

The reason for the storage structure: balancing processor speed and main memory speed

Cache coherence and synchronization primitives

To ensure that parallel programs run correctly and efficiently, shared memory multiprocessor systems must provide hardware support for cache coherence, storage coherence and synchronization primitives.

Cache consistency basics

As shown in the figure,

bus-based multi-processor cache coherence problem

Disadvantages of write-to-cache coherence protocols: writes to cache blocks are localized in time and space. Write direct. Each write triggers a bus write and thus occupies the bus bandwidth. Under the write-back cache mechanism, if one or more words or bytes in the same cache block are written multiple times, the bus bandwidth only needs to be occupied once to invalidate other caches. Just copy, bandwidth will be used up quickly.


Protocol for write-back cache MSI: Compared with write-through cache, using write-back cache will significantly reduce bandwidth overhead (write-back cache has a status: dirty, which is used to mark whether any position in the cache block has changed since it was loaded)

Each cache block has associated status:

  • Modified(M): The cache block is valid and its data is (possibly) different from the original data in main memory.
  • Shared(S): The cache block is valid and may be shared by other processors. It is also clean, i.e. the cached value is the same as the value in main memory. This state is similar to that in the "write-through" cache coherence protocol V.
  • Invalid(I): Cache block is invalid

Disadvantage: Regardless of whether a block is stored on only one cache block, MSIthe protocol triggers two bus transactions on read and write sequence requests. This flaw affects the performance of programs that share little data, such as sequential programs.


Write-back cache MESIprotocol: To solve MSIthe problem, MSEIthe protocol adds a state to distinguish whether a cache block is clean and unique, or clean but has copies in multiple caches.

Each cache block has associated status:

  • Modified(M)
  • Exclusive(E): Cache blocks are clean, valid and unique
  • Shared(S)
  • Invalid(I)

Main memory bandwidth can be reduced through dirty sharing


Write-back cache MOESIprotocol: This protocol generally allows dirty sharing and is generally used MESIon Intel Xeon processors.MOESIAMD

Each cache block has associated status:

  • Modified(M)
  • Exclusive(E)
  • Owned(O): The cache block is valid, may be dirty, or may have multiple copies. However, when there are multiple copies, only one can be in Ostate, and the other copies are all in Sstate.
  • Shared(S)
  • Invalid(I)

Write-back caching based on update protocol

Hardware support for synchronization

Lock
Implementation type of lock:

  • TSLock
  • TTSLLock
  • LL/SCLock
  • TicketLock
  • ABQLLock
standard test&set TTSL LL/SC Ticket ABQL
No contention delay lowest lower lower higher higher
Maximum communication volume for a single lock release operation O(1)
Waiting for traffic high
storage O(1) O(1) O(1) O(1)
Ensure fairness? no no no yes yes

barrier

Barrier implementation types:

  • Flip induction centralized barrier
  • Combination tree barrier
  • Hardware barrier implementation

transactional memory

Storage consistency model and cache consistency solution

Storage consistency model

Separate from the cache coherence protocol, the cache coherence protocol can only solve the problem of how to sequence accesses to a single memory block address, and accesses to different addresses are not a problem that the cache coherence protocol refers to.

Advanced cache coherence design

directory consistency protocolSnooping

As the earliest mainstream implementation of cache coherence, it relies on two facts:

  • First, as a broadcast medium, the bus can make requests globally visible, that is, all caches can see which request appears on the bus at the same time.
  • Second, all first-level caches and second-level caches closely monitor (sniff) requests appearing on the bus at the same time, and independently and correctly change the state of the corresponding cache line.

As the number of processors increases, the available interconnection network bandwidth will quickly be filled up by broadcast traffic.

Listening consensus protocolDirectory

Is another common implementation of cache coherence, which relies on the second-level cache to record the sharing of cache lines in the first-level cache.

In a directory-based coherence protocol, any cache coherence request requires a first visit to the directory in the second-level cache. The advantage of the directory is that point-to-point data transmission replaces the global broadcast in sniffing: this property is particularly important when the system has a large number of computing cores.

interconnection network architecture

Distributed operating system

SIMT architecture

Single instruction stream multi-thread SMIT Single-Instruction Multiple-Threadarchitecture, generally used in graphics processors GPU, similar toSIMD

SMIDThe difference between and SMIT:

  • From the perspective of threads: SMIDGenerally, one thread processes one instruction. This instruction is vectorized and generally requires one cycle. SIMTGenerally, multiple threads are created, which requires multiplecycle
  • From a logic unit perspective: SMITcompared to SMIDrequiring 4 times as many logic units

From the perspective of hardware architecture, SIMTthe architecture generally converts scalar instructions into vectorization-style SIMDprocessing to obtain higher performance.

See article:

Reference books:

  • "Computer Architecture: Quantitative Research Methods"
  • "In-depth Linux Kernel Architecture"
  • "Distributed Operating System"
  • 《Multicore Processors and Systems》

Guess you like

Origin blog.csdn.net/qq_48322523/article/details/128010918