Parallel and Distributed Computing: OpenMP Detailed Explanation (5)
-
- Section 5 OpenMP
-
- Compile instruction statement
-
- Guiding sentence: #pragma omp parallel for
- Guiding sentence: #pragma omp parallel
- Guiding sentence: #pragma omp for
- Guiding sentence: #pragma omp single
- Guiding sentence: #pragma omp section(s)
- Guiding sentence: #pragma omp task
- Guidance sentence: #progma omp barrier and #progma omp taskwait
- Clause reduction
- Clause schedule
- Library Functions
- Advantages and disadvantages of OpenMP
- Write and debug OpenMp on Visual Studio 2019
Section 5 OpenMP
OpenMP has three API components
- Compiler directives
- Runtime library routines (library function)
- Environment variables
Compile instruction statement
Syntax: #pragma omp directive name clause
The instruction sentence is next to its related program sentence (appears in the previous sentence)
Guiding sentence: #pragma omp parallel for
#pragma omp parallel for
This statement tells the compiler that the next for loop can be executed in parallel
- The number of iterations in the loop must be calculated before the loop is executed
- The loop cannot contain break, return or exit
- The loop cannot contain goto statements that go outside the loop
For example:
int a[1000], b[1000], s[1000];
...
#pragma omp parallel for
for (i= 0; i< 1000; i++)
s[i] = a[i] + b[i];
Guiding sentence: #pragma omp parallel
- Indicates that one of the following structures should be executed in parallel
- This generally applies to the case of a single program with multiple data sets (SPMD)
- In other words, if you use the same method for multiple data sets, you can consider this clause
Guiding sentence: #pragma omp for
- Only used in structures that have been marked by pragma parallel
- Indicates that the content of the following for loop should be allocated to multiple threads previously opened by parallel for execution
- The work content of the for loop is executed only once, and the content outside the for loop is executed once per thread
- Combine the for-related content executed by each thread, and it is a complete single for loop without repetition and omission.
- At the end of the for loop, there is a barrier to synchronize all processes, so that everyone ends the for loop at the same time and enters the next step
Look at the following example, the part marked by pragma for is allocated to multiple threads andOnly executed once
Guiding sentence: #pragma omp single
Conditions of use: only used in one parallel code block
Effect
- Tell the compiler to use only one thread to execute the code structure immediately below
- Other threads in the same group as the thread will wait for it to complete the code structure
- Unless you use nowait nowaitThe n o w a i t clause tells other threads not to wait
- Similar to the single nowait effectGuide sentence master
- Unless you use nowait nowaitThe n o w a i t clause tells other threads not to wait
Actual situation
- Can be used to handle I/O and other tasks that may be wrong when using multiple threads at the same time
Guiding sentence: #pragma omp section(s)
section: represents an independent code section in sections
sections: represents one or several section code blocks that are closed to be divided into a group of threads
- Between section and thread is the relationship between meat and monk. Each thread monk tries to grab meat to eat. After eating one piece, grab another piece.
- Once the meat is served, all the thread monks will immediately try to grab a piece of meat and eat it
- Each piece of meat can only be taken by one monk to eat
- If some thread monks eat fast enough and there are a lot of meat, then there may be thread monks who eat too much meat
Guiding sentence: #pragma omp task
Start a child thread to perform the following subtasks
Refer to the parallel calculation of the following Fibonacci sequence
Guidance sentence: #progma omp barrier and #progma omp taskwait
Tasks completion and synchronization
The following conditions can guarantee that the task has been completed
- After the end of the thread and task, all sub-threads and sub-tasks must have been completed
- At the instruction barrier
- All threads in the thread group are synchronized here
- At the instruction taskwait
- All child threads of a thread are synchronized here
Clause reduction
The reduction clause is common, parallel, for, and sections all support the reduction clause
- #pragma omp … reduction(operator : listVariable)
- List variables are a series of variables marked by reduction, separated by ","
The reduction clause is used to reduce variables. When used, reduction will make a private backup of all variables in the list (use an appropriateInitial value), and use this backup for parallel calculations. After the for loop is over, the variables will be reduced (for example, +:sum will create a variable named sum for each thread for addition operation, and at the end all sum and the initial sum The values are merged and the sum operation is completed)
The initial values corresponding to common operators are as follows
Code example:
#include<stdio.h>
#include<omp.h>
int main()
{
omp_set_num_threads(2);
int sum = 3;
int prod = 5;
#pragma omp parallel for reduction(+:sum,q) reduction(*:prod) num_threads(2)
for (int i = 1; i <= 3; ++i)
{
int tid = omp_get_thread_num();
sum += i;
prod *= i;
printf("thread(%d) ""sum = % d prod = % d\n", tid, sum, prod);
}printf("results: ""sum = % d prod = % d\n", sum, prod);
}
The output value is
thread(0) sum = 1 prod = 1
thread(1) sum = 3 prod = 3
thread(0) sum = 3 prod = 2
results: sum = 9 prod = 30
//sum在0和1线程的初始值都是0,prod的初始值都是1
//0线程处理i=1和i=2的任务,1线程处理i=3的任务
Clause schedule
Just splitting the tasks of the for loop is not fine enough
Take a chestnut (different colors indicate tasks assigned to different processors, a row corresponds to a task of i iterator)
Since the relationship between the amount of tasks and the iterator may be different, how can the 12 tasks in a for loop be more evenly distributed to different threads according to the amount of tasks to make the execution time shorter?
Consider static planning
Before execution, the thread to which each iterator should be assigned has been determined.
Questions that need to be considered based on the actual situation: How to determine the way of static planning to make the task load more balanced?
Consider dynamic programming
Iterators are dynamically allocated to each thread in the loop execution.
Questions that need to be considered based on the actual situation: how to balance the distribution overhead and load balancing in the dynamic allocation?
schedule clause
Usage: schedule(kind,chunkSize)
- Static dispatch
- 0-chunkSize-1 is allocated to the first thread chunkSize-2chunkSize-1 is allocated to the second thread, and so on
- Low overhead, but may cause load imbalance
- The static allocation method is suitable for situations where the amount of tasks is known before execution, so that proper allocation of tasks at the beginning will not cause too much load imbalance
- If you do not fill in size, the default is equal distribution (size=number of iterations/number of threads rounded up)
- Dynamic scheduling
- Assign chunkSize iterators to an available thread each time
- That is, when the thread is idle, it will automatically pick up a chunkSize task
- Since the start time and execution time of the thread are uncertain, it is impossible to know in advance which thread the iterator is assigned to
- Do not fill in size, the default is 1 (distributed one by one)
- Assign chunkSize iterators to an available thread each time
- Heuristic scheduling guided
- chunkSize represents the minimum number of iterations allocated for each assignment
- The number of iterations assigned to the thread is different each time, it may be relatively large at first, and then gradually decrease
- It is a more flexible dynamic allocation. When there are a large number of tasks, more tasks will be allocated at one time, so that the allocation consumes less and the execution efficiency is higher; when the amount of tasks is small, fewer threads are allocated at one time, thus making More balanced load
Library Functions
int omp_get_num_procs(void);//获取当前可用的物理处理器数目
void omp_set_num_threads(int t);//设置程序中激活的线程数
Advantages and disadvantages of OpenMP
advantage
- Suitable for domain decomposition (data parallel)
- Can run on *nix and windows
Disadvantage
- Not very suitable for functional decomposition
Write and debug OpenMp on Visual Studio 2019
To use VS2019, OpenMP support must be turned on, right click on the source file, properties, C/C++/, language, OpenMp