Table of contents
Classification
According to
Flynn
taxonomy
Flynn
The taxonomy defines the classification of parallel computers based on the number of instruction streams and data streams.
Number of data streams | |||
---|---|---|---|
one | many | ||
Number of instruction streams | one | SISD |
SIMD |
many | MISD |
MIMD |
SISD
: Single instruction stream and single data stream. Ancient architecture.
SIMD
: Single instruction stream and multiple data streams. Generally in GPU
, such as Intel MMX/SSE
, AMD 3DNow!
etc.
MISD
: Multiple instruction streams and single data stream. Difficult to apply, eg iWarp
.
MIMD
: Multiple instruction streams and multiple data streams.
MIMD computer classification
According to the number of processors, shared memory processors are divided into two categories: SMP/UMA
andDSM/NUMA
SMP/UMA
: Shared memory multiprocessor, or centralized shared memory multiprocessor or consistent memory access multiprocessor (generally the number of cores does not exceed 8). Shared cache and main memory.
DSM/NUMA
: Multiprocessors use physically distributed memory, called distributed shared memory, also called non-uniform memory access.
Parallel programming
Fundamentally, the expectation of executing an algorithm or code in parallel is to obtain a shorter execution time than a serial algorithm or code. A useful tool for analyzing parallel program execution times is the
Amdahl
law
Parallel programming model
Shared storage and messaging model.
Shared memory model: Different threads or parallel tasks executing can access any location in the memory, and they can implicitly communicate with each other by writing and reading memory locations (similar to the shared address space between multiple threads belonging to the same process) )
Message passing model: Threads have their own local memory, and one thread cannot access the memory of another thread. When threads exchange data, they need to communicate with each other by explicitly passing messages containing data values (similar to multiple processes that do not share address space with each other).
Model comparison
Shared storage model | Message passing model | |
---|---|---|
communication | Implicit | explicit message |
Synchronize | explicit | implicit message |
Hardware support | usually need | unnecessary |
Programming effort | lower | higher |
Tuning workload | higher | lower |
communication granularity | Thinner | thicker |
Shared memory generally requires specialized hardware support. On similar multi-core processors, the last level cache may have been shared between processor cores. However, in multi-node case. Each node has its own processor and memory, and the nodes are interconnected to form a shared storage system. At this time, hardware support is needed to implement an image, that is, the memory of all nodes forms a single processor that can be addressed by all processors. .
When the number of processors is large, it becomes difficult to implement shared memory abstraction at a low cost.
Other programming models:
- Partitioned global address space
PGAS
: allows all threads to transparently share a single address space - Data Parallel Programming Model: Similar to
SIMD
MapReduce
: cluster- Transactional memory
TM
: Define a piece of code as a transaction - …
Shared storage parallel model
Take OpenMP
, for example, an application programming interface that supports shared memory programming and consists of a set of compiler directives that programmers can use to express OpenMP
parallelism to supported compilers. The compiler replaces instructions with code that calls library functions or reads environment variables that affect the program's runtime behavior.
OpenMP官网
: OpenMP official website link
The original purpose of the design was to express parallelism
OpenMP
in a loop structure , using an execution model in which the computer is executed by one thread (the main thread) in the serial part. When a parallel segment is encountered, the main thread generates child threads to execute together until the end of the parallel segment, and the child threads are merged back into the main thread.DOALL
fork-join
general use
#pragma omp directive-name [clause[[,] clause] ... ] new-line
// when for is the directive-name, what clause?
// private(variable-list)
// firstprivate(variable-list)
// lastprivate(variable-list)
// reduction(operator: variable-list)
// ordered
// schedule(kind[, chunk_size])
// nowait
To represent a parallel segment#pragma omp parallel
#pragma omp parallel
{
// begin
// parallel content
} // end
Parallel Programming for LDS
TM
Transactional memory can be used to simplify LDS
parallel programming to some extent by encapsulating TM
each LDS
operation in a transaction, such as
atomic{
Insert(...)} // insert a factor
atomic{
Delete(...)} // delete a factor
storage hierarchy
The reason for the storage structure: balancing processor speed and main memory speed
Cache coherence and synchronization primitives
To ensure that parallel programs run correctly and efficiently, shared memory multiprocessor systems must provide hardware support for cache coherence, storage coherence and synchronization primitives.
Cache consistency basics
As shown in the figure,
bus-based multi-processor cache coherence problem
Disadvantages of write-to-cache coherence protocols: writes to cache blocks are localized in time and space. Write direct. Each write triggers a bus write and thus occupies the bus bandwidth. Under the write-back cache mechanism, if one or more words or bytes in the same cache block are written multiple times, the bus bandwidth only needs to be occupied once to invalidate other caches. Just copy, bandwidth will be used up quickly.
Protocol for write-back cache MSI
: Compared with write-through cache, using write-back cache will significantly reduce bandwidth overhead (write-back cache has a status: dirty, which is used to mark whether any position in the cache block has changed since it was loaded)
Each cache block has associated status:
Modified(M)
: The cache block is valid and its data is (possibly) different from the original data in main memory.Shared(S)
: The cache block is valid and may be shared by other processors. It is also clean, i.e. the cached value is the same as the value in main memory. This state is similar to that in the "write-through" cache coherence protocolV
.Invalid(I)
: Cache block is invalid
Disadvantage: Regardless of whether a block is stored on only one cache block, MSI
the protocol triggers two bus transactions on read and write sequence requests. This flaw affects the performance of programs that share little data, such as sequential programs.
Write-back cache MESI
protocol: To solve MSI
the problem, MSEI
the protocol adds a state to distinguish whether a cache block is clean and unique, or clean but has copies in multiple caches.
Each cache block has associated status:
Modified(M)
Exclusive(E)
: Cache blocks are clean, valid and uniqueShared(S)
Invalid(I)
Main memory bandwidth can be reduced through dirty sharing
Write-back cache MOESI
protocol: This protocol generally allows dirty sharing and is generally used MESI
on Intel Xeon processors.MOESI
AMD
Each cache block has associated status:
Modified(M)
Exclusive(E)
Owned(O)
: The cache block is valid, may be dirty, or may have multiple copies. However, when there are multiple copies, only one can be inO
state, and the other copies are all inS
state.Shared(S)
Invalid(I)
Write-back caching based on update protocol
Hardware support for synchronization
Lock
Implementation type of lock:
TS
LockTTSL
LockLL/SC
LockTicket
LockABQL
Lock
standard | test&set | TTSL | LL/SC | Ticket | ABQL |
---|---|---|---|---|---|
No contention delay | lowest | lower | lower | higher | higher |
Maximum communication volume for a single lock release operation | O§ | O§ | O§ | O§ | O(1) |
Waiting for traffic | high | – | – | – | – |
storage | O(1) | O(1) | O(1) | O(1) | O§ |
Ensure fairness? | no | no | no | yes | yes |
barrier
Barrier implementation types:
- Flip induction centralized barrier
- Combination tree barrier
- Hardware barrier implementation
transactional memory
Storage consistency model and cache consistency solution
Storage consistency model
Separate from the cache coherence protocol, the cache coherence protocol can only solve the problem of how to sequence accesses to a single memory block address, and accesses to different addresses are not a problem that the cache coherence protocol refers to.
Advanced cache coherence design
directory consistency protocolSnooping
As the earliest mainstream implementation of cache coherence, it relies on two facts:
- First, as a broadcast medium, the bus can make requests globally visible, that is, all caches can see which request appears on the bus at the same time.
- Second, all first-level caches and second-level caches closely monitor (sniff) requests appearing on the bus at the same time, and independently and correctly change the state of the corresponding cache line.
As the number of processors increases, the available interconnection network bandwidth will quickly be filled up by broadcast traffic.
Listening consensus protocolDirectory
Is another common implementation of cache coherence, which relies on the second-level cache to record the sharing of cache lines in the first-level cache.
In a directory-based coherence protocol, any cache coherence request requires a first visit to the directory in the second-level cache. The advantage of the directory is that point-to-point data transmission replaces the global broadcast in sniffing: this property is particularly important when the system has a large number of computing cores.
interconnection network architecture
Distributed operating system
SIMT architecture
Single instruction stream multi-thread SMIT Single-Instruction Multiple-Thread
architecture, generally used in graphics processors GPU
, similar toSIMD
SMID
The difference between and SMIT
:
- From the perspective of threads:
SMID
Generally, one thread processes one instruction. This instruction is vectorized and generally requires onecycle
.SIMT
Generally, multiple threads are created, which requires multiplecycle
- From a logic unit perspective:
SMIT
compared toSMID
requiring 4 times as many logic units
From the perspective of hardware architecture, SIMT
the architecture generally converts scalar instructions into vectorization-style SIMD
processing to obtain higher performance.
See article:
- Introduction to SIMD
- "Computer Architecture: Quantitative Research Methods": A brief introduction to thread-level parallelism (TLP)
- Parallel Programming - OpenMP
- [Operating system] SMP vs NUMA vs MPP architecture introduction
- A brief discussion on cache coherence protocols and non-uniform cache access in multi-core systems
- SIMD and SIMT from the perspective of modern GPU programming
Reference books:
- "Computer Architecture: Quantitative Research Methods"
- "In-depth Linux Kernel Architecture"
- "Distributed Operating System"
- 《Multicore Processors and Systems》