Chapter One

Parallel program: There are p cores, each core is independent of other cores and accumulated and summed to obtain partial sums, and the number of calculations per core is n/p

Global sum: After each individual core calculates the result, it sends the result to the master's core, and the master's core adds up the results of each core.
Partial sum: After each core is calculated, first add two by two, and then sum two by two

Obviously the second method is better!

Approaches to Parallel Programs: Task Parallelism and Data Parallelism

Task parallelism: assign each task to be executed to each core for execution
Data parallelism: assign the data to be processed to each core

Chapter two

Foster method:

Divide the problem and identify tasks
Identify the communication to perform in the task
Cohesion or aggregation into larger group tasks
Assign aggregation tasks to processes/threads

Operation

2.1
In the ladder diagram, the capacity increases sequentially from top to bottom, the rate decreases sequentially, and the cost per byte decreases sequentially
insert image description here

2.2
After the data in the memory is loaded into the Cache, it will be written back to the memory at a certain moment. There are five strategies for writing the memory: write-through, write-back, and write-once. once), WC (write-combining) and UC (uncacheable).

写通（write-through）：

When the cache write hits , the processor writes the data to the cache and writes the data to the memory at the same time. The data in the memory and the data in the cache are synchronized. This method is relatively simple and reliable. However, every update to the cache needs to be written to the memory, so the bus is busy, and the bandwidth of the memory is greatly occupied, so the running speed will be affected.

Assuming that a program is frequently modifying a local variable, the life cycle of the local variable is very short, and other processes/threads do not use it, the CPU will still frequently exchange data between the Cache and the memory, resulting in unnecessary bandwidth loss.

When the cache write misses , it can only be written directly to the main memory, but at this time, whether to fetch the modified main memory block to the cache or not, there are two choices in the write-through method.

One is to fetch and allocate a location for it, called WTWA (Write–Through–with–Write–Allocate).
The other is not called WTNWA method (WriteThrough–with.NO-Write–Allocate).

The former method maintains the consistency of the cache/main memory, but the operation is complicated, while the latter method simplifies the operation, but the hit rate is reduced. Only when the cache is replaced by a read miss, the modified block of the memory may be shot into the cache. . The write-through method ensures that the write cache and the main memory are written synchronously. The figure shows the flow chart of the write-through method WTNWA method.

insert image description here

There is a buffer, and the CPU updates the main memory data while writing to the cache

Chapter 5 OpenMP for shared memory programming

OpenMP is an API for programming shared memory.

MP = multiprocessing

In the system, each thread of OpenMP has the possibility to access all
accessible memory regions
Shared memory system: Think of the system as a collection of cores or CPUs
that all have access to main memory

insert image description here

OpenMP and Pthread connection and difference

insert image description here

In Pthreads, when forking and merging multiple threads, you need to allocate storage space for each thread's special structure, you need to use a for loop to start each thread, and use another for loop to terminate these threads
In OpenMP, there is no need to explicitly start and terminate multiple threads. It is higher level than Pthreads

OpenMP thread concept

Start a process, and the process starts the threads.

Threads share most of the resources of the process that started them, such as access to standard input and standard output, but each thread has its own
stack and program counter

Before parallel, the program used only one thread, and when the program started executing, the process started to start. When the program reaches the parallel instruction, the original thread continues to execute, and another thread_count-1 threads are started.

In OpenMP syntax, a collection of threads (consisting of original threads and newly spawned threads) that execute parallel blocks of code is called a thread group
The original thread is called the main thread, and the newly created thread is called the slave thread

Implicit roadblocks : i.e. the thread completing the code block will wait for all other threads in the thread group to return

Private stack : each thread has its own stack, so a thread executing the Hello function will create its own private local variables in the function

Out-of-order output : Note that because standard output is shared by all threads, each thread can execute the printf statement, printing its thread number and thread count. Since access to standard output is not scheduled, the actual order in which threads print their results is undefined

error checking

The header file omp.h and calling omp_get_thread_num and
omp_get_num_threads will cause an error
If the compiler doesn't support OpenMP then it will just ignore the parallel directive
To deal with these, you can check the preprocessor macros
_OPENMPIs it defined.
Able to include omp.h and call OpenMP functions if defined

insert image description here

variable scope

Shared scope : A variable that can be accessed by all threads in the thread group
.
Variables declared in the main function (a, b, n, global_result and thread_count) are accessible
to all threads in the thread group started by the parallel instruction
- The default scope of variables declared before the parallel block is shared.
  - Each thread can access a, b, n (Trap call);
  - In the Trap function, although global_result_p is a private variable,
    it refers to the global_result variable, which is of shared scope.
    therefore,
- global_result_p has shared scope.
Private scope : Variables that can only be accessed by a single thread.
- Variables declared in the parallel block have private scope (local variables in the function)

In the parallel directive, the default scope of all variables is shared

In parallel for, the default scope of the loop variable is private within the loop it parallelizes .

Using "Fork/Join" mode, parallel execution mode

insert image description here

OpenMP programming

OpenMP is an "instruction-based" shared memory API.
The header file of OpenMP is omp.h
omp.h is composed of a set of functions and macro libraries
The stdlib.h header file is the standard library standard library header file.
- Five types, some macros and general utility functions are defined in stdlib.h

preprocessing directive

Preprocessor directives are used in C and C++ to allow behavior that is not part of the base C language specification;
- eg: special preprocessor directives#pragma
- Compilers that do not support pragmas will ignore
  those statements that are hinted by the pragma directive.
- Allows programs using pragmas to run on platforms that do not support them
  .

# pragma omp parallel:
The parallel instruction is used to indicate that the subsequent structured code block should be executed in parallel by multiple threads

A structured code block is a C language statement or a group of compound C statements with only one entry and exit

Writing specs:

Preprocessor directives start with #pragma
# put in the first column
pragma: align with other code
The default length of pragma is one line
If there is a pragma that does not fit in one line, then the new line needs to be escaped
- '\'

OpenMP rules

In OpenMP, preprocessor directives start with #pragma omp

# pragma omp parallel clause1 \
clause2 … clauseN new-line

The compilation instruction statement consists of directive (instruction) and clause list (clause list)

#pragma omp parallel [clause…] new-line
Structured-block

#pragma omp parallel private(i,j)
其中parallel就是指令，private是子句，且指令后
的子句是可选的。

A new-line (line break) is required after the compilation instruction statement. Then followed by a structured-block

structured-block: a for loop or a pair of curly braces (and all code inside)

当整个编译指导语句较长时，也可以分多行书写，用
C/C++的续行符“\”连接起来即可。如:

#progma omp parallel claused1 \
	Claused2 … clausedN new-line
Structured-block

reduction clause reduction

The reduce operation is a binary operation (such as addition or multiplication).
A reduction operation is to repeatedly apply the same operation to a sequence to obtain a result. All intermediate and final results are stored in reducer variables

The reduction operation is implemented by the reduction instruction:

insert image description here

for & parallel for

The for instruction is used to distribute a for loop to multiple threads for execution

How to use the for command:

1.单独用在parallel语句的并行块中；
2.与parallel指令合起来形成parallel for指令

Note: for instruction and parallel for instruction

 派生出一组线程来执行后面的结构化代码块
 只能并行for语句

parallel combined with for instance:

include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){
    
    
	int j=0;
	#pragma omp parallel num_threads(4)
	{
    
    
	#pragma omp for
	 for(j=0;j<4;j++)
	 printf("j=%d,ThreadId=%d\n",j,omp_get_thread_num());
	}
return 0;
}

If the two are not combined, it can be seen from the results that the four loops are executed in one thread. It can be seen that the for instruction should be used in combination with the parallel instruction.

multiple for

#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){
    
    
int j=0;
#pragma omp parallel num_threads(4)
{
    
    
	#pragma omp for
	 for(j=0;j<4;j++)
	  printf("First:j=%d,ThreadId=%d\n",j,omp_get_thread_num());
	#pragma omp for
	 for(j=0;j<4;j++)
		printf("Second:j=%d,ThreadId=%d\n",j,omp_get_thread_num());
}
return 0;
}

Limitations of the for loop

insert image description here

Some limitations of the for loop construct are as follows:

The variable index must be an integer or pointer type.
The expressions start, end and incr must be of compatible types.
The expressions start, end and incr cannot be changed during loop execution, that is,
the number of iterations must be definite.
During the execution of the loop, the variable index can only be modified by the "increment expression" in the for statement
.
The for statement cannot contain break.

private clause

Parameters appearing in reduction clauses cannot appear in private

The initial value of a private variable declared with the private clause is unspecified at the beginning of a parallel block or parallel for block, and is also unspecified after completion

demo: The variable j before the for loop and the variable j in the loop area are actually two different variables
it does not inherit the value of a shared variable of the same name

default clause

The default clause is used to allow the user to control the shared attributes of variables in the parallel region,

The usage is as follows:default(shared|none)

When using shared , by default, variables with the same name passed into the parallel region are
treated as shared variables, and no thread-private copies will be generated, unless
clauses such as private are used to specify that certain variables are private.
If none is used as a parameter , then the variables used in the thread must explicitly specify
whether they are shared variables or private variables, except those explicitly defined

insert image description here

shared child clause

Used to declare one or more variables as shared variables,
the usage is as follows:shared(list)

Please be aware of:

When using a shared variable in a parallel region, if there is a write operation, the shared variable must be protected;
otherwise, do not use the shared variable lightly, and try to convert the access of the shared variable into the access of the private variable

Concept summary

insert image description here

The default loop variable in parallel for is the permission of the private scope, which is greater than the setting permission of default (none) or default (shared)

example code

for loop serial code:

#include <stdio.h>

int main(int argc, char *argv[]){
    
    
	 int i;
	 for (i=0; i<10; i++){
    
    
	 	printf("i=%d\n",i);
	 }
	 printf("Finished.\n");
	 return 0;
}

Parallel code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[]){
    
    
	 int i;
	 //用来指定执行之后的代码块的线程数目
	 int thread_count = strtol(argv[1],NULL,10);
	# pragma omp parallel for num_threads(thread_count)
	 for (i=0; i<10; i++){
    
    
	 	printf("i=%d\n",i);
	 }
	 printf("Finished.\n");
	 return 0;
}

compile: gcc −g −Wall −fopenmp −o omp_example omp_example .c
execute: . /omp_example 44 threads execute

HelloWorld code

insert image description here

Each enabled thread executes the Hello function, and when the thread returns from the Hello call, the thread is terminated.
- The thread that finishes executing first is in a blocked waiting state ( implicit roadblock )
The process was terminated while executing the return statement

trapezoidal integral

serial code

insert image description here

In the trap function, each thread obtains its number and the total number of threads, and then determines the following values:
1. The length of the trapezoid base
2. The number of trapezoids allocated to each thread
3. The left and right endpoint values of the trapezoid
4. Part and my_result
5 , Add the partial sum to the global and global_result

insert image description here

Assign tasks to threads

insert image description here

Parallel change

insert image description here

critical section

Code that causes a race, called a critical section
- Competition: Multiple threads attempt to access and update a shared resource.
- Therefore, some mechanism is needed to ensure that only one thread executes the critical section at a time . This operation is called mutual exclusion .

Unexpected results may occur when two or more threads are added to global_result

critical:
The critical instruction is used before a critical section of code. Only one thread can execute it in the critical section at the same time, and other threads need to be queued to execute it if they want to execute the critical section.

#pragma omp critical[(name)] new-line
	structured-block
	# progma omp critical
	global_result += my_result

insert image description here

Π estimate

insert image description here

serial
parallel

insert image description here

Parity Swap Sort

Three points: OpenMP implements parity sorting, loop scheduling, environment variables

Sort by:

insert image description here
奇偶交换排序Is a variant of bubble sort, the algorithm is more suitable for parallelization

insert image description here

Serial function implementation:

insert image description here

OpenMP parallel implementation (version 1): use two parallel for loops to achieve parity

insert image description here

OpenMP parallel implementation (version 2): use two for loops to achieve parity

double omp_get_wtime(void): Return the time elapsed from a specific point, in seconds; (this time point must be consistent during the running of the program)

Comparing the two schemes:

round-robin scheduling

In OpenMP, task scheduling is mainly used for parallelism of for loops. When the calculation amount of each iteration in the loop is not equal, if you simply assign the same number of iterations to each thread, the calculation load of each thread will be unbalanced, so that some threads are executed first and some are executed later , resulting in some CPU cores are idle, affecting program performance.

更好的方案是轮流分配线程的工作

insert image description here

schedule child clause

In OpenMP, the task scheduling for parallelization of the for loop can be implemented using the schedule clause.

schedule(<type[< chunksize >]>)
type of type:

static: The iteration can be 循环执行前assigned to the thread system and assign tasks to threads in a round-robin manner;
dynamic or guided: iterations 循环执行时are assigned to threads
- Assigned to threads as the loop executes, so that after a thread has completed its current set of iterations, it can request more from the runtime system .
- The iteration is divided into chunks of consecutive iterative tasks of size chunksize ( default 1 )
- Each thread executes a block. After execution, it will request another block from the system. The size of the new block will change and become smaller and smaller.
runtime: scheduling is 运行时decided
- When schedule(runtime) is specified, the system uses environment variables
- OMP_SCHEDULE is used at runtime to decide how to schedule the loop.
- export OMP_SCHEDULE="static,1"
- OMP_SCHEDULE may take on any value that can be used by static, dynamic, or guided scheduling.
- When the type parameter type is runtime, chunksize参数是非法的.
auto : The compiler and runtime system determine the scheduling method

chunksize is a positive integer and is the number of iterations in the chunk ,一次分配一个线程几个数

insert image description here

Scheduling overhead:guided>dynamic>static
- If we conclude that the default scheduling method has low performance, then we will do a lot of experiments to find the optimal scheduling method and number of iterations.

environment variable

Definition: Environment variables are named values that can be accessed by the runtime system, that is, they are available in the program's environment.

Common environment variables in Linux system: PATH, HOME, SHELL

使用export命令设置环境变量值 export OMP_DYNAMIC=TRUE

There are four main environment variables for OpenMP:

OMP_DYNAMIC
- FALSE : Allow the function omp_set_num_thread() or the num_threads clause to set the number of threads;
- TRUE : Then the runtime will be adjusted according to factors such as system resources. Generally speaking, generating threads equal to the number of CPUs is the best use of resources.
OMP_NUM_THREADS
- The OMP_NUM_THREADS environment variable is mainly used to set the default number of threads in the parallel parallel region.
- The OMP_NUM_THREADS environment variable only works when OMP_DYNAMIC is FALSE.

指定线程数目的方法：
1. 不指定，即默认为处理器的核数；
2. 使用库函数omp_set_num_threads（int num）
3. 在#parama omp parallel num_threads(num)
4. 使用环境变量
 $gcc –fopenmp -o 1 1.c
 $OMP_NUM_THREADS =4
 $export OMP_NUM_THREADS//设置环境变量值

OMP_NESTED
- When OMP_NESTED is TRUE, nested parallelism will be started
- Nested parallelism can be stopped using the omp_set_nested() function called with argument 0
OMP_SCHEDULE
- It is mainly used to set the scheduling type, and it is only valid when the parameter of the schedule clause is runtime

internal control variable

insert image description here
clause:

schedule：调度任务实现分配给线程任务的不同划分方式

function:

omp_get_wtime()：计算OpenMP并行程序花费时间
omp_set_num_thread()： 设置线程的数量

OpenMP implements producer & consumer

The producer/consumer model describes that there is a buffer zone as a warehouse. Producers can put products into the warehouse, and consumers can take products out of the warehouse. The model diagram is as follows:
insert image description here

Three relationships: consumer to consumer; producer to producer; producer to consumer;
Two types of roles: referring to producers and consumers;
A trading place: a trading place refers to a warehouse for data exchange between producers and consumers. This warehouse is equivalent to a buffer zone. The producer is responsible for putting the data into the buffer zone, and the consumer is responsible for putting the data in the buffer zone. The data is taken out

insert image description here

queue

A queue is an abstract data structure. When an element is inserted, the element is inserted at the end of the queue, and when an element is read, the element at the head of the queue is returned and removed from the queue (first in, first out)

. The element in the queue is consumer

messaging

Another application of the producer and consumer model: multithreaded message passing on a shared memory system.

每一个线程有一个共享消息队列，当一个线程要向另外一个线程发送消息时，它将消息放入目标线程的消息队列中
一个线程接收消息时，只需从它的消息队列的头部取出消息。

Each thread sends and receives messages alternately, and the user needs to specify the number of messages sent by each thread.
When a thread has sent all messages, the thread continues to receive messages until all other threads have completed, at which point all threads are terminated.

insert image description here

Send a message

Notice:访问消息队列并将消息入队，可能是一个临界区

Message enqueuing requires a variable to keep track of the tail of the queue
- For example, a single linked list is used to implement a message queue, and the tail of the linked list corresponds to the tail of the queue. Then, for efficient enqueuing, a pointer to the end of the linked list needs to be stored
When a new message is enqueued, the tail pointer needs to be checked and updated
If two threads try to update the tail pointer at the same time, a message that has been enqueued by one of the threads may be lost

insert image description here

receive message

The critical section problem for receiving a message is somewhat different than sending a message.

Only the owner of the message queue (that is, the target thread) can get messages from a given message queue .
If there are at least two messages in the message queue, as long as only one message is dequeued at a time, it is impossible for the dequeue operation and the enqueue operation to conflict. So you can avoid the critical section problem by keeping track of the queue size if there are at least two messages in the queue.

insert image description here

How to store or calculate the queue size queue_size?

If only one variable is used to store the queue size, operations on that variable will form critical sections ( conflicts )

as follows:
insert image description here

solution:

Two variables can be used: enqueued and dequeued, then the number of messages in the queue
(the size of the queue) isqueue_size = enqueued – dequeued
- The only thread that can update dequeued is the owner of the message queue.
- For enqueued, one thread can use enqueued to calculate the queue size while another thread can update enqueued.

insert image description here

terminate inspection

insert image description here

Solution:
In our program, each thread will not send any messages after executing the for loop.
可设置一个计数器done_sending，每个线程在for循环结束后将该计数器加1, Done is implemented as follows:

start up

When the program starts to execute, the main thread will get the command line parameters and allocate an array of message queues, each thread corresponds to a message queue.
Since each thread can send messages to any other thread , that is, each thread can insert a message into any message queue, this queue array should be shared by all threads .

queue allocation problem

Problem : One or more threads may complete its queue allocation before other threads. then 完成分配的线程可能会试图开始向那些还没有完成队列分配的线程发送消息，这将导致程序崩溃.

It must be ensured that any thread must start sending messages after all threads have completed queue allocation

Solution: An explicit barricade is required such that when a thread encounters a barricade, it will be blocked until all threads in the group have reached the barricade.

Implicit roadblock: End of OpenMP directive.
Explicit roadblocks: in the middle of OpenMP directives, such as in the middle of a parallel block.

insert image description here

Blocking: After all the threads in the thread group reach this roadblock, they will continue to execute.

atomic

used to solve the problem of critical section

critical instruction

atomic instruction: Only the critical section formed by a C language assignment statement can be protected.#pragma omp atomic

insert image description here
thought :

许多处理器提供专门的装载-修改-存储指令

使用这种专门的指令而不使用保护临界区的通用结构，可以更高效地保护临界区。

The structure of the message queue:

1. 消息列表
2. 队尾指针
3. 队首指针
4. 入队消息的数目
5. 出队消息的数目

In order to reduce the overhead of copying when passing parameters, it is best to use an array of pointers to structures to implement message queues

一个数组，若其元素均为指针类型数据，称为指针数组。即指针数组中的每一个元素都相当于一个指针变量。
类型名 *数组名[数组长度]; 如int *p[4];

insert image description here

conditional compilation

insert image description here

Lock

Lock mechanism: used when mutual exclusion is required for a data structure rather than a code block.

instruction:

 atomic指令: 实现互斥访问最快的方法。
 critical指令：保护临界区，实现互斥访问。

function:

 void omp_init_lock(omp_lock_t*lock);//初始化锁
 void omp_destroy_lock(omp_lock_t*lock)//销毁锁
 void omp_set_lock(omp_lock_t*lock);//尝试获得锁
 void omp_unset_lock(omp_lock_t*lock);//释放锁

OpenMP compilation command

$gcc -g -Wall -fopenmp -o outputdir filename.c
$./outputdir

parallel programming

Chapter One

Chapter two

Foster method:

Operation

Chapter 5 OpenMP for shared memory programming

OpenMP and Pthread connection and difference

OpenMP thread concept

error checking

variable scope

OpenMP programming

preprocessing directive

OpenMP rules

reduction clause reduction

for & parallel for

Limitations of the for loop

private clause

default clause

shared child clause

Concept summary

example code

trapezoidal integral

serial code

Parallel change

critical section

Π estimate

Parity Swap Sort

Sort by:

round-robin scheduling

schedule child clause

environment variable

internal control variable

OpenMP implements producer & consumer

queue

messaging

Send a message

receive message

terminate inspection

start up

queue allocation problem

atomic

The structure of the message queue:

conditional compilation

Lock

OpenMP compilation command

Guess you like