Self-study OpenMP Guide [Multi-layer for loop]

In many scenarios, in order to achieve a certain goal, we will use a multi-layer for loop to solve the problem. How to use openmp to accelerate the multi-layer for loop is the concern of this blog. This blog will discuss the following 3 points

Let’s start with an example of a multilayer loop

Look at this piece of code first:

for(int i = 0; i < 2; i++) {
	cout << "first loop"<< endl;
	for (int j = 0; j < 2; j++) {
		cout << "second loop" << endl;
		for (int k = 0; k < 2; k++) {
			printf("third loop i = %d j = %d k = %d \n");
		}
	}
}

The above is 0-test.cc, and the code will be modified later to get 1-test.cc,2-test.cc.3-test.cc

In this case, there are a total of 3 layers of for loops, and each layer is printed. Our goal is to accelerate these three layers of loops.

First of all, we should be clear that the main reason openmp can increase the running speed is parallelism. Without openmp, the program is serial, one operation after another. But in fact, many operations themselves do not affect each other, so it is possible to improve performance. We can make the program run multiple programs that do not interfere with each other at the same time, thereby improving efficiency.

If there is an influence between different operations of a certain program, can't it be parallelized? Not all. Parallel programs cannot affect each other. Therefore, if you want to improve the programs that affect each other, you can first change the code so that there is no effect between different operations, and then parallelize. This will be written in other blogs later, so I will not mention it here.

Taking the previous example to illustrate, just looking at the printing operation of the innermost layer, you can find that in fact, for the printing of the innermost layer, this program only repeats the printing 8 times. These 8 times do not interfere with each other, so it is Performance can be improved through parallelism.

The point is how we use openmp for parallelism.

Some parallel attempts

In order to better explain the role of OpenMP here, I will use 4 examples to specifically introduce

1-test.cc
opm_set_num_threads(4)
#pragma omp parallel
for(int i = 0; i < 2; i++) {
 cout << "first loop"<< endl;
 for (int j = 0; j < 2; j++) {
  cout << "second loop" << endl;
  for (int k = 0; k < 2; k++) {
   printf("third loop i = %d j = %d k = %d \n");
  }
 }
}

Unfortunately, this code does not improve performance. It #pragma omp paralleldoes generate a specified number of threads, but each thread in the multiple threads of 1-test.cc completes the entire program, so the time is not reduced.

2-test.cc
opm_set_num_threads(4)
#pragma omp parallel
#pragma parallel for
for(int i = 0; i < 2; i++) {
 cout << "first loop"<< endl;
 for (int j = 0; j < 2; j++) {
  cout << "second loop" << endl;
  for (int k = 0; k < 2; k++) {
   printf("third loop i = %d j = %d k = %d \n");
  }
 }
}

This code really improves the efficiency, and the time can be reduced to half.

What is the reason for this? By querying the explanation given by the IBM official,
we can know #pragma omp parallelthat the following operations will be performed:

  • Generate the specified number of thread groups
  • Each thread completes all operations within the effective scope of the statement
  • The operations
    IBM official explanation
    #pragam parallel forin the working-sharing construct area will be completed by different threads respectively, which completes the function of the working-sharing construct area. Therefore, one must be added before the for loop #pragma parallel forto achieve parallelism.
3-test.cc

#pragma omp parallel forCan realize the function of 2 lines of code in 2-test.cc

opm_set_num_threads(4)
#pragma omp parallel for
for(int i = 0; i < 2; i++) {
 cout << "first loop"<< endl;
 for (int j = 0; j < 2; j++) {
  cout << "second loop" << endl;
  for (int k = 0; k < 2; k++) {
   printf("third loop i = %d j = %d k = %d \n");
  }
 }
}

in conclusion

For the multi-layer for loop code structure, if the branches of the innermost for loop do not affect each other, #pragma omp parallel forconcurrency can be achieved by adding the outermost for loop .

However, special attention to this method is that the innermost branch of the multi-layer for loop cannot affect each other, and for the influential for loop, the code structure should be changed before concurrency, so that the inner structure does not affect each other, and then the code can be carried out Concurrency.

Moreover, not all multi-layer for loops can be parallelized simply by one line of instructions. For more complex code structures, you should first clarify the code logic, then rewrite the original code into a parallel code structure, and then perform parallelization.

Guess you like

Origin blog.csdn.net/kullollo/article/details/105732923