Parallel and Distributed Computing: MPI Advanced (7)
The fish is short, try to be saltier! ! !
Section 7 MPI Collective Communication
In order to facilitate communication with readers, I will define in this article and later articlesnot interestedTake care of everything for you (for example, you are learning from Teacher Luo’s and you are now doing homework), and don’t care about some unimportant things (for example, things that Teacher Luo didn’t teach and won’t take the exam), just want to be rough Understand a rough mentality
not interestedIt is prepared for P college students who are busy with their studies. If you are interested in related content and want to lay a good foundation, then you must have all the knowledge in the articleAll interested
7.1 Overview
This part takes matrix-vector multiplication as an example, and introduces the following parts
7.1.1 Four communication functions
- MPI_Allgatherv: Data collection function, different processes can provide different numbers of elements
- MPI_Scatterv: Distribution data operation, different processes can obtain different numbers of elements
- MPI_Gatherv: data collection operation: the number of elements collected from different processes can be different
- MPI_Alltoallv: full exchange operation, exchange data elements between all processes
7.1.2 Five communication domain functions
- MPI_Dims_create: Create a new dimension for a balanced Cartesian process grid
- MPI_Cart_create: Create a Cartesian topology communication domain
- MPI_Cart_coords: Returns the coordinates of a process in the Cartesian process grid
- MPI_Cart_rank: Returns the process number of the process at a certain coordinate in the Cartesian process grid
- MPI_Comm_split: divide the process of a communication domain into one or more groups
7.1.3 Collective Communication
- one-to-all broadcast,all-to-one reduction
- all-to-all broadcast,all-to-all reduction
- gather,scatter
- all-to-all exchange
7.2 Some notes
The problem we study is to calculate A b = c Ab=cA b=c , whereAAA ism ∗ nm*nm∗matrix of n ,bbb is an n-dimensional column vector,ccc is an m-dimensional column vector. For the convenience of learning, we assume m=n. For b and c, the increased storage space overhead of copying them does not affect the order of magnitude of space complexity, so we simply copy them to each process, so the following decomposition mainly considers the decomposition of the matrix.
In order to avoid I/O confusion, we always use only one process to read data, and only one process to output results.
7.3 Break down by line
From the perspective of rows, the result of matrix-vector multiplication can be regarded as the dot product of n n-dimensional vector pairs, and the motivation for decomposing vector multiplication by rows is very obvious.
When we properly decompose the matrix by rows and perform appropriate aggregation and mapping, what we actually need to do is to collect the results of each process and put them together to become the total result.
It is easy to read and send the matrix decomposed by rows. Because the read order is originally read by row, only the data in the continuous memory needs to be sent to the specified thread.
MPI_Allgatherv
A full collection communication can connect vector data blocks distributed in a group of processes, andCopy the results to all processes。
If you want to collect the same number of functions from each process, the simpler MPI_Allgather function is very suitable. But we have no way to guarantee that the tasks assigned to each process are equal, so we use MPI_Allgather.
The function declaration is as follows
int MPI_Allgatherv(
void* send_buffer,//此进程要发送的数据的起始地址
int send_cut,//此进程要发送的数据个数
MPI_Datatype send_type,//发送的数据类型
void* receive_buffer,//用来存放收集到的元素的起始地址
int* receive_cnt,//一个数组,第i个元素为第i个进程接受数据的数据个数
int* receive_disp,//一个数组,第i个元素为从第i个进程接受的数据项的偏移量
MPI_Datatype receive_type,//接收的数据类型
MPI_Comm communicator//本操作所在的通信域
);
In other words, the result of the i-th process is transformed from receive _ buffer [receive _ disp [i]] receive\_buffer[receive\_disp[i]]receive_buffer[receive_disp[i]]到 r e c e i v e _ b u f f e r [ r e c e i v e _ d i s p [ i ] + r e c e i v e _ c n t [ i ] − 1 ] receive\_buffer[receive\_disp[i]+receive\_cnt[i]-1] receive_buffer[receive_disp[i]+receive_cnt[i]−1 ] array elements
MPI_Gatherv
The basic functions and parameters of this function are similar to Allgatherv, except that it collects data from each process and gathers it into one process
I won't repeat it here
7.4 Break down by column
From the perspective of linear space, we can also understand that Ab is a linear combination of the vector in A according to the parameters in b, denoted by A = (α 1,..., Α n), b T = (b 1,... , bn), A b = ∑ i = 1 nbi α i A=(\alpha_1,...,\alpha_n),b^T=(b_1,...,b_n),Ab=\sum_(i=1 }^nb_i\alpha_iA=( a1,...,an),bT=(b1,...,bn),A b=∑i=1nbiai
Therefore, the motivation for the breakdown by column is also extremely obvious. Each task is to expand the α vector to the original b times, and perform a reduction operation at the end.
MPI_Scatterv
When decomposing by column, sending data is obviously not as simple as just before. For each row, we have to segment them and send them to different processes.
The function declaration is as follows
int MPI_Scatterv(
void* send_buffer,//此进程要发送的数据的起始地址
int* send_cnt,//一个数组,第i个元素为发送给第i个进程的数据个数
int* send_disp,//一个数组,第i个元素为发送给第i个进程的数据在send_buffer中的偏移量
MPI_Datatype send_type,//发送的数据类型
void* recv_buffer,//本进程用来存放接受元素的指针
int recv_cnt,//本进程要接受的数据个数
MPI_Datatype recv_type,//接收的数据类型
int root,//分发数据的进程ID
MPI_Comm communicator//本操作所在的通信域
);
MPI_Scatterv is a group communication function, in the communication domainAll processes are involved in execution. This function requires all processes to initialize two arrays, one indicates the number of data sent by the root process to each process, and the other indicates the offset of the issued data in the buffer. The distribution operation is executed in the order of process number, process 0 gets the first block, process 1 gets the second block...
MPI_Alltoallv
As we said at the beginning, after the j-th task has calculated n products, if it still wants to calculate c[j], then it must leave a value and get the rest from other tasks. n-1 values. All tasks must issue n-1 values and collect the required n-1 results, which is called all-to-all exchange
MPI_Alltoallv can complete the exchange of data between all processes in a communication domain
The declaration of the function is as follows
int MPI_Alltoallv(
void* send_buffer,//待交换数组的起始地址
int* send_count,//第i个元素指定发送给进程i的元素个数
int* send_displacement,//第i个元素为发送给进程i的数据在send_buffer的起始地址
MPI_Datatype send_type,
void* recv_buffer,//接收数据(包括自己发给自己的数据)的缓冲区起始地址
int* recv_count,//第i个元素表示本进程将要从进程i接收的数据个数
int* recv_displacement,//第i个元素表示从进程i接收的数据在recv_buffer的起始地址
MPI_Datatype recv_type,
MPI_Comm communicator
);
7.5 Checkerboard decomposition
Starting from the perspective of block multiplication of the matrix, A can be decomposed into p blocks in a chessboard manner, and the calculation of each block is arranged to a process.
The information sending and receiving of matrix decomposition is worth thinking about, but it is not our core concern. So we abstract this problem into a checkerboard-like process group, in which the first column is involved in the task of collecting data d (in this example, it can be assumed that b is divided into k segments, which are given to each of the first column Thread-although this is contradictory to our initial assumption of copying b directly to each thread-I say this just to raise questions), the first row participates in the task of distributing data d (distributed to its same-column process) , All processes in each row perform an independent summation protocol, and a vector is generated in the process in the first column, and this vector is collected in the process in the first column and the first row (referred to as the zero thread).
So in fact, what we have to do is such a thing: the first step, all processes in the first column gather data to the zero thread (All to one reduction in the sense of the first column); the second step, the zero thread will information Distributed to each thread in the first row (Scatter in the sense of the first row); in the third step, each process listed in the first row copies the information to the process in the same column (One to all broadcast in the sense of each column). We can find that in the above operation, there is actually a smaller communication domain. In other words, the process is divided into several communication groups. Intuitively, we can also feel that communication in a smaller group is far more efficient than partial communication in a larger group. This is what we will talk about next.
We next introduce the concept of communication domain on the basis of this type of problem. As for the calculation process of the specific results and the subsequent statutes and transmissions, they are similar to this and will not be repeated.
Communication domain
A communication domain consists of a process group, context, and other attributes.
Process topology is an important feature of the communication domain.
- Topology can establish a new addressing mode for processes, not just using process numbers
- The topology is virtual, which means it does not depend on the actual connection of the processor
- MPI supports two topologies: Cartesian topology and graph topology
MPI_Dims_create
In order to make the matrix vector product algorithm have the best scalability, the virtual process grid that is established should be close to a square (this reason does not need to be studied for the time being, you can verify it yourself after learning the scalability), so we will use Cartesian Topology (mesh topology)
Only pass the number of nodes and dimensions of the Cartesian grid to this function, and it will give the number of nodes in each dimension in size. If we have special requirements for this grid, we can also manually specify in the size
int MPI_Dims_create(
int nodes,//网格中的进程数
int dims,//我们想要的网格维数
int* size//每一维度的大小,如果size[i]为0,那么这个函数将决定这个维度的大小
);
In particular, if dims=2 and size is all 0, size[0], size[1] will respectively display the number of rows and columns of the grid (????? What kind of shit setting is this??)
MPI_Cart_create
After determining the size of each dimension of the virtual network, it is necessary to establish a communication domain for this topology. The group function MPI_Cart_create can accomplish this task, and its declaration is as follows:
int MPI_Cart_create(
MPI_Comm old_comm,//旧的通信域。这个通讯域中的所有进程都要调用该函数
int dims,//网格维数
int* size,//长度为dims的数组,size[j]是第j维的进程数
int* periodic,//长度为dims的数组,如果第j维有周期性,那么periodic[j]=1,否则为0(这个我也有疑问)
int reorder,//进程是否能重新被编号,如果为0则进程在新的通信域中仍保留在旧通信域的标号
MPI_Comm* cart_comm//该函数返回后,此变量将指向新的笛卡尔通信域起始地址
//(这是一个通讯域数组,每一个元素代表一个新的通信域)
);
MPI_Cart_rank
The function of this function is to obtain its process number by its coordinates in the grid
MPI_Cart_coords
The function of this function is to determine the coordinates of a thread in the virtual grid
MPI_Comm_split
Divide a communication domain into several groups
7.6 Collective Communication
Group communication is an important concept in parallel computing
7.6.1 Topological structure of communication network
Before talking about group communication, we must make it clear what kind of process network we are in for group communication.
What we call a communication network is all about the networkTopology, That is, which processors and which processors are connected in our parallel programming architecture, regardless of whether they are actually close
concept
A lot of common concepts are given below, fornot interestedOf students do not need to watch, if you do, the next sectionNo need to look at all analysis(Also can't understand)
- Direct topology: a switch corresponds to a processor node, and a switch node is connected to a processor node and one or more other switches
- Indirect topology: a processor node is connected to multiple switches, some switches can only be connected to other switches
- Diameter: The maximum distance between two switch nodes. The diameter determines the lower bound of the complexity of the parallel algorithm for communication between random node pairs, so a smaller diameter is better
- Bisection bandwidth: Bisection bandwidth is the minimum number of edges that must be deleted in order to divide the network in half. The larger the bisection bandwidth, the better. In an algorithm that requires a large amount of data movement, the lower bound of the complexity of the parallel algorithm is the data set size divided by the bisection bandwidth
- The number of edges of each switch node: It is best that the number of edges of each switch node has nothing to do with the network size, so that it is easier to expand
- Fixed side length: For scalability reasons, the best case is that the nodes and edges of the network can be arranged in a three-dimensional space, so that the maximum side length is a constant independent of the network size
Common topology
Many common topologies are given below,not interestedStudents only need to understand the ring structure (Ring) and the hypercube (Hybercube)
In the illustration, the circle represents the switch and the square represents the processor. The switch is a means to control the communication of the processor to ensure the safety and reliability of its communication. The specific content can be seen in the textbook.not interestedCan not watch
-
Ring structure
As the name suggests, the processors are connected in a line with one processor, two processors, and a closed loop.
-
Two-dimensional mesh network (direct topology)
Benefits: good scalability (switch side is constant)
-
Binary tree network (indirect topology)
Benefit: small diameter
Disadvantages: the bisection bandwidth is small (1)
-
Hypertree network (indirect topology)
Benefits: smaller diameter, larger bifurcation bandwidth than binary tree
The 4-branch hypertree network is superior to the binary tree network in almost every aspect. It has few switching nodes, small diameter and large bisecting bandwidth
-
Butterfly network
-
Hypercube network
Diameter: logn, bisecting bandwidth n/2
Transmission method: There is only one binary bit difference on the adjacent side, and a path can be generated by sequentially changing the difference binary bit between the initial point and the target point
-
Shuffle-switched network
This is a compromise solution with fixed sides, smaller diameters and better bisecting bandwidth
summary
7.6.2 Broadcast & Reduction
B&R operations are mainly divided into the following types
- one-to-all broadcast,all-to-one reduction
- all-to-all broadcast,all-to-all reduction
- gather,scatter
- all-to-all exchange(Personalized Communication)
Note that when discussing each operation must be explainedTopological network currently attached
One-to-all broadcast,All-to-one reduction
The basic effects of this communication are as follows
Take the ring topology as an example (the numbers on the dashed line represent the sequence of occurrence, and the arrows represent the data sent from the starting point to the end point)
All-to-all broadcast,All-to-all reduction
The effect of this communication is as follows
Take the ring topology as an example (Broadcast) as follows (the numbers on the dotted line represent the sequence of occurrence, and the arrows represent the data sent from the start point to the end point)
(The following is one of the simplest and most simple methods, using p times one-to-all broadcast can easily achieve this effect, although it is not efficient)
An example of a two-dimensional grid structure (Broadcast) is as follows (the broadcast is only halfway through, and it can be completed by another communication between columns)
Gather,scatter
Scatter has a certain similarity with One-to-all Broadcast (the information source is unique and sent to all processes). In fact, if in the process of implementing One-to-all Broadcast, each node is made to intercept the messages it needs, And remove it from the broadcast content to realize Scatter
All-to-all exchange(All-to-all Personalized Communication)
To understand from the left, M i, j M_{i,j}Mi,jMeans existing in iiProcess i , to be sent tojjMessage of process j (the opposite is true on the right)