Introduction to Parallel and Distributed Computing (5) OpenMP Basics

Section 5 OpenMP

OpenMP has three API components

  • Compiler directives
  • Runtime library routines (library function)
  • Environment variables

Compile instruction statement

Syntax: #pragma omp directive name clause

The instruction sentence is next to its related program sentence (appears in the previous sentence)

Guiding sentence: #pragma omp parallel for

#pragma omp parallel for

This statement tells the compiler that the next for loop can be executed in parallel

  • The number of iterations in the loop must be calculated before the loop is executed
  • The loop cannot contain break, return or exit
  • The loop cannot contain goto statements that go outside the loop

For example:

int a[1000], b[1000], s[1000];
...
#pragma omp parallel for
for (i= 0; i< 1000; i++)
    s[i] = a[i] + b[i];

Guiding sentence: #pragma omp parallel

  • Indicates that one of the following structures should be executed in parallel
  • This generally applies to the case of a single program with multiple data sets (SPMD)
    • In other words, if you use the same method for multiple data sets, you can consider this clause

Guiding sentence: #pragma omp for

  • Only used in structures that have been marked by pragma parallel
  • Indicates that the content of the following for loop should be allocated to multiple threads previously opened by parallel for execution
    • The work content of the for loop is executed only once, and the content outside the for loop is executed once per thread
    • Combine the for-related content executed by each thread, and it is a complete single for loop without repetition and omission.
    • At the end of the for loop, there is a barrier to synchronize all processes, so that everyone ends the for loop at the same time and enters the next step

Look at the following example, the part marked by pragma for is allocated to multiple threads andOnly executed once

Insert picture description here

Guiding sentence: #pragma omp single

Conditions of use: only used in one parallel code block

Effect

  • Tell the compiler to use only one thread to execute the code structure immediately below
  • Other threads in the same group as the thread will wait for it to complete the code structure
    • Unless you use nowait nowaitThe n o w a i t clause tells other threads not to wait
      • Similar to the single nowait effectGuide sentence master

Actual situation

  • Can be used to handle I/O and other tasks that may be wrong when using multiple threads at the same time

Guiding sentence: #pragma omp section(s)

section: represents an independent code section in sections

sections: represents one or several section code blocks that are closed to be divided into a group of threads

  • Between section and thread is the relationship between meat and monk. Each thread monk tries to grab meat to eat. After eating one piece, grab another piece.
    • Once the meat is served, all the thread monks will immediately try to grab a piece of meat and eat it
    • Each piece of meat can only be taken by one monk to eat
    • If some thread monks eat fast enough and there are a lot of meat, then there may be thread monks who eat too much meat

Guiding sentence: #pragma omp task

Start a child thread to perform the following subtasks

Refer to the parallel calculation of the following Fibonacci sequence

Insert picture description here

Guidance sentence: #progma omp barrier and #progma omp taskwait

Tasks completion and synchronization

The following conditions can guarantee that the task has been completed

  • After the end of the thread and task, all sub-threads and sub-tasks must have been completed
  • At the instruction barrier
    • All threads in the thread group are synchronized here
  • At the instruction taskwait
    • All child threads of a thread are synchronized here

Clause reduction

The reduction clause is common, parallel, for, and sections all support the reduction clause

  • #pragma omp … reduction(operator : listVariable)
    • List variables are a series of variables marked by reduction, separated by ","

The reduction clause is used to reduce variables. When used, reduction will make a private backup of all variables in the list (use an appropriateInitial value), and use this backup for parallel calculations. After the for loop is over, the variables will be reduced (for example, +:sum will create a variable named sum for each thread for addition operation, and at the end all sum and the initial sum The values ​​are merged and the sum operation is completed)

The initial values ​​corresponding to common operators are as follows

Code example:

#include<stdio.h>
#include<omp.h>
int main()
{
	omp_set_num_threads(2);
	int sum = 3;
	int prod = 5;
	#pragma omp parallel for reduction(+:sum,q) reduction(*:prod) num_threads(2)
	for (int i = 1; i <= 3; ++i)
	{
		int tid = omp_get_thread_num();
		sum += i;
		prod *= i;
		printf("thread(%d) ""sum = % d prod = % d\n", tid, sum, prod);
	}printf("results: ""sum = % d prod = % d\n", sum, prod);
}

The output value is

thread(0) sum =  1 prod =  1
thread(1) sum =  3 prod =  3
thread(0) sum =  3 prod =  2
results: sum =  9 prod =  30
//sum在0和1线程的初始值都是0,prod的初始值都是1
//0线程处理i=1和i=2的任务,1线程处理i=3的任务

Clause schedule

Just splitting the tasks of the for loop is not fine enough

Take a chestnut (different colors indicate tasks assigned to different processors, a row corresponds to a task of i iterator)

Since the relationship between the amount of tasks and the iterator may be different, how can the 12 tasks in a for loop be more evenly distributed to different threads according to the amount of tasks to make the execution time shorter?

Consider static planning

Before execution, the thread to which each iterator should be assigned has been determined.

Questions that need to be considered based on the actual situation: How to determine the way of static planning to make the task load more balanced?

Consider dynamic programming

Iterators are dynamically allocated to each thread in the loop execution.

Questions that need to be considered based on the actual situation: how to balance the distribution overhead and load balancing in the dynamic allocation?

schedule clause

Usage: schedule(kind,chunkSize)

  • Static dispatch
    • 0-chunkSize-1 is allocated to the first thread chunkSize-2chunkSize-1 is allocated to the second thread, and so on
    • Low overhead, but may cause load imbalance
    • The static allocation method is suitable for situations where the amount of tasks is known before execution, so that proper allocation of tasks at the beginning will not cause too much load imbalance
    • If you do not fill in size, the default is equal distribution (size=number of iterations/number of threads rounded up)
    • [External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-qwnIhNcL-1585721404096) (C:\Users\56875\AppData\Roaming\Typora\typora-user-images\ image-20200401103201129.png)]
  • Dynamic scheduling
    • Assign chunkSize iterators to an available thread each time
      • That is, when the thread is idle, it will automatically pick up a chunkSize task
      • Since the start time and execution time of the thread are uncertain, it is impossible to know in advance which thread the iterator is assigned to
    • Do not fill in size, the default is 1 (distributed one by one)
    • [External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-wfsdmnVR-1585721404097) (C:\Users\56875\AppData\Roaming\Typora\typora-user-images\ image-20200401103136547.png)]
  • Heuristic scheduling guided
    • chunkSize represents the minimum number of iterations allocated for each assignment
    • The number of iterations assigned to the thread is different each time, it may be relatively large at first, and then gradually decrease
    • It is a more flexible dynamic allocation. When there are a large number of tasks, more tasks will be allocated at one time, so that the allocation consumes less and the execution efficiency is higher; when the amount of tasks is small, fewer threads are allocated at one time, thus making More balanced load
    • [External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-RIs0G1NF-1585721404097) (C:\Users\56875\AppData\Roaming\Typora\typora-user-images\ image-20200401103121264.png)]

Library Functions

int omp_get_num_procs(void);//获取当前可用的物理处理器数目
void omp_set_num_threads(int t);//设置程序中激活的线程数

Advantages and disadvantages of OpenMP

advantage

  • Suitable for domain decomposition (data parallel)
  • Can run on *nix and windows

Disadvantage

  • Not very suitable for functional decomposition

Write and debug OpenMp on Visual Studio 2019

To use VS2019, OpenMP support must be turned on, right click on the source file, properties, C/C++/, language, OpenMp

Guess you like

Origin blog.csdn.net/Kaiser_syndrom/article/details/105244881