Introduction to Parallel and Distributed Computing (7) MPI Collective Communication


The fish is short, try to be saltier! ! !

Section 7 MPI Collective Communication

In order to facilitate communication with readers, I will define in this article and later articlesnot interestedTake care of everything for you (for example, you are learning from Teacher Luo’s and you are now doing homework), and don’t care about some unimportant things (for example, things that Teacher Luo didn’t teach and won’t take the exam), just want to be rough Understand a rough mentality

not interestedIt is prepared for P college students who are busy with their studies. If you are interested in related content and want to lay a good foundation, then you must have all the knowledge in the articleAll interested

7.1 Overview

This part takes matrix-vector multiplication as an example, and introduces the following parts

7.1.1 Four communication functions

  • MPI_Allgatherv: Data collection function, different processes can provide different numbers of elements
  • MPI_Scatterv: Distribution data operation, different processes can obtain different numbers of elements
  • MPI_Gatherv: data collection operation: the number of elements collected from different processes can be different
  • MPI_Alltoallv: full exchange operation, exchange data elements between all processes

7.1.2 Five communication domain functions

  • MPI_Dims_create: Create a new dimension for a balanced Cartesian process grid
  • MPI_Cart_create: Create a Cartesian topology communication domain
  • MPI_Cart_coords: Returns the coordinates of a process in the Cartesian process grid
  • MPI_Cart_rank: Returns the process number of the process at a certain coordinate in the Cartesian process grid
  • MPI_Comm_split: divide the process of a communication domain into one or more groups

7.1.3 Collective Communication

  • one-to-all broadcast,all-to-one reduction
  • all-to-all broadcast,all-to-all reduction
  • gather,scatter
  • all-to-all exchange

7.2 Some notes

The problem we study is to calculate A b = c Ab=cA b=c , whereAAA ism ∗ nm*nmmatrix of n ,bbb is an n-dimensional column vector,ccc is an m-dimensional column vector. For the convenience of learning, we assume m=n. For b and c, the increased storage space overhead of copying them does not affect the order of magnitude of space complexity, so we simply copy them to each process, so the following decomposition mainly considers the decomposition of the matrix.

In order to avoid I/O confusion, we always use only one process to read data, and only one process to output results.

7.3 Break down by line

From the perspective of rows, the result of matrix-vector multiplication can be regarded as the dot product of n n-dimensional vector pairs, and the motivation for decomposing vector multiplication by rows is very obvious.

When we properly decompose the matrix by rows and perform appropriate aggregation and mapping, what we actually need to do is to collect the results of each process and put them together to become the total result.

It is easy to read and send the matrix decomposed by rows. Because the read order is originally read by row, only the data in the continuous memory needs to be sent to the specified thread.

MPI_Allgatherv

A full collection communication can connect vector data blocks distributed in a group of processes, andCopy the results to all processes

If you want to collect the same number of functions from each process, the simpler MPI_Allgather function is very suitable. But we have no way to guarantee that the tasks assigned to each process are equal, so we use MPI_Allgather.

image-20200407164950118

The function declaration is as follows

int MPI_Allgatherv(
	void* send_buffer,//此进程要发送的数据的起始地址
    int send_cut,//此进程要发送的数据个数
    MPI_Datatype send_type,//发送的数据类型
    void* receive_buffer,//用来存放收集到的元素的起始地址
    int* receive_cnt,//一个数组,第i个元素为第i个进程接受数据的数据个数
    int* receive_disp,//一个数组,第i个元素为从第i个进程接受的数据项的偏移量
    MPI_Datatype receive_type,//接收的数据类型
    MPI_Comm communicator//本操作所在的通信域
);

In other words, the result of the i-th process is transformed from receive _ buffer [receive _ disp [i]] receive\_buffer[receive\_disp[i]]receive_buffer[receive_disp[i]] r e c e i v e _ b u f f e r [ r e c e i v e _ d i s p [ i ] + r e c e i v e _ c n t [ i ] − 1 ] receive\_buffer[receive\_disp[i]+receive\_cnt[i]-1] receive_buffer[receive_disp[i]+receive_cnt[i]1 ] array elements

MPI_Gatherv

The basic functions and parameters of this function are similar to Allgatherv, except that it collects data from each process and gathers it into one process

image-20200407172822293

I won't repeat it here

7.4 Break down by column

From the perspective of linear space, we can also understand that Ab is a linear combination of the vector in A according to the parameters in b, denoted by A = (α 1,..., Α n), b T = (b 1,... , bn), A b = ∑ i = 1 nbi α i A=(\alpha_1,...,\alpha_n),b^T=(b_1,...,b_n),Ab=\sum_(i=1 }^nb_i\alpha_iA=( a1,...,an),bT=(b1,...,bn),A b=i=1nbiai

Therefore, the motivation for the breakdown by column is also extremely obvious. Each task is to expand the α vector to the original b times, and perform a reduction operation at the end.

MPI_Scatterv

When decomposing by column, sending data is obviously not as simple as just before. For each row, we have to segment them and send them to different processes.

image-20200407171948950

The function declaration is as follows

int MPI_Scatterv(
	void* send_buffer,//此进程要发送的数据的起始地址
    int* send_cnt,//一个数组,第i个元素为发送给第i个进程的数据个数
    int* send_disp,//一个数组,第i个元素为发送给第i个进程的数据在send_buffer中的偏移量
    MPI_Datatype send_type,//发送的数据类型
    void* recv_buffer,//本进程用来存放接受元素的指针
    int recv_cnt,//本进程要接受的数据个数
    MPI_Datatype recv_type,//接收的数据类型
    int root,//分发数据的进程ID
    MPI_Comm communicator//本操作所在的通信域
);

MPI_Scatterv is a group communication function, in the communication domainAll processes are involved in execution. This function requires all processes to initialize two arrays, one indicates the number of data sent by the root process to each process, and the other indicates the offset of the issued data in the buffer. The distribution operation is executed in the order of process number, process 0 gets the first block, process 1 gets the second block...

MPI_Alltoallv

As we said at the beginning, after the j-th task has calculated n products, if it still wants to calculate c[j], then it must leave a value and get the rest from other tasks. n-1 values. All tasks must issue n-1 values ​​and collect the required n-1 results, which is called all-to-all exchange

MPI_Alltoallv can complete the exchange of data between all processes in a communication domain

image-20200407173605218

The declaration of the function is as follows

int MPI_Alltoallv(
	void* send_buffer,//待交换数组的起始地址
    int* send_count,//第i个元素指定发送给进程i的元素个数
    int* send_displacement,//第i个元素为发送给进程i的数据在send_buffer的起始地址
    MPI_Datatype send_type,
    void* recv_buffer,//接收数据(包括自己发给自己的数据)的缓冲区起始地址
    int* recv_count,//第i个元素表示本进程将要从进程i接收的数据个数
    int* recv_displacement,//第i个元素表示从进程i接收的数据在recv_buffer的起始地址
    MPI_Datatype recv_type,
    MPI_Comm communicator
);

7.5 Checkerboard decomposition

Starting from the perspective of block multiplication of the matrix, A can be decomposed into p blocks in a chessboard manner, and the calculation of each block is arranged to a process.

The information sending and receiving of matrix decomposition is worth thinking about, but it is not our core concern. So we abstract this problem into a checkerboard-like process group, in which the first column is involved in the task of collecting data d (in this example, it can be assumed that b is divided into k segments, which are given to each of the first column Thread-although this is contradictory to our initial assumption of copying b directly to each thread-I say this just to raise questions), the first row participates in the task of distributing data d (distributed to its same-column process) , All processes in each row perform an independent summation protocol, and a vector is generated in the process in the first column, and this vector is collected in the process in the first column and the first row (referred to as the zero thread).

So in fact, what we have to do is such a thing: the first step, all processes in the first column gather data to the zero thread (All to one reduction in the sense of the first column); the second step, the zero thread will information Distributed to each thread in the first row (Scatter in the sense of the first row); in the third step, each process listed in the first row copies the information to the process in the same column (One to all broadcast in the sense of each column). We can find that in the above operation, there is actually a smaller communication domain. In other words, the process is divided into several communication groups. Intuitively, we can also feel that communication in a smaller group is far more efficient than partial communication in a larger group. This is what we will talk about next.

We next introduce the concept of communication domain on the basis of this type of problem. As for the calculation process of the specific results and the subsequent statutes and transmissions, they are similar to this and will not be repeated.

Communication domain

A communication domain consists of a process group, context, and other attributes.

Process topology is an important feature of the communication domain.

  • Topology can establish a new addressing mode for processes, not just using process numbers
  • The topology is virtual, which means it does not depend on the actual connection of the processor
  • MPI supports two topologies: Cartesian topology and graph topology

MPI_Dims_create

In order to make the matrix vector product algorithm have the best scalability, the virtual process grid that is established should be close to a square (this reason does not need to be studied for the time being, you can verify it yourself after learning the scalability), so we will use Cartesian Topology (mesh topology)

Only pass the number of nodes and dimensions of the Cartesian grid to this function, and it will give the number of nodes in each dimension in size. If we have special requirements for this grid, we can also manually specify in the size

int MPI_Dims_create(
    int nodes,//网格中的进程数
    int dims,//我们想要的网格维数
    int* size//每一维度的大小,如果size[i]为0,那么这个函数将决定这个维度的大小
);

In particular, if dims=2 and size is all 0, size[0], size[1] will respectively display the number of rows and columns of the grid (????? What kind of shit setting is this??)

MPI_Cart_create

After determining the size of each dimension of the virtual network, it is necessary to establish a communication domain for this topology. The group function MPI_Cart_create can accomplish this task, and its declaration is as follows:

int MPI_Cart_create(
    MPI_Comm old_comm,//旧的通信域。这个通讯域中的所有进程都要调用该函数
    int dims,//网格维数
    int* size,//长度为dims的数组,size[j]是第j维的进程数
    int* periodic,//长度为dims的数组,如果第j维有周期性,那么periodic[j]=1,否则为0(这个我也有疑问)
    int reorder,//进程是否能重新被编号,如果为0则进程在新的通信域中仍保留在旧通信域的标号
    MPI_Comm* cart_comm//该函数返回后,此变量将指向新的笛卡尔通信域起始地址
    					//(这是一个通讯域数组,每一个元素代表一个新的通信域)
);

MPI_Cart_rank

The function of this function is to obtain its process number by its coordinates in the grid

MPI_Cart_coords

The function of this function is to determine the coordinates of a thread in the virtual grid

MPI_Comm_split

Divide a communication domain into several groups

7.6 Collective Communication

Group communication is an important concept in parallel computing

7.6.1 Topological structure of communication network

Before talking about group communication, we must make it clear what kind of process network we are in for group communication.

What we call a communication network is all about the networkTopology, That is, which processors and which processors are connected in our parallel programming architecture, regardless of whether they are actually close

concept

A lot of common concepts are given below, fornot interestedOf students do not need to watch, if you do, the next sectionNo need to look at all analysis(Also can't understand)

  1. Direct topology: a switch corresponds to a processor node, and a switch node is connected to a processor node and one or more other switches
  2. Indirect topology: a processor node is connected to multiple switches, some switches can only be connected to other switches
  3. Diameter: The maximum distance between two switch nodes. The diameter determines the lower bound of the complexity of the parallel algorithm for communication between random node pairs, so a smaller diameter is better
  4. Bisection bandwidth: Bisection bandwidth is the minimum number of edges that must be deleted in order to divide the network in half. The larger the bisection bandwidth, the better. In an algorithm that requires a large amount of data movement, the lower bound of the complexity of the parallel algorithm is the data set size divided by the bisection bandwidth
  5. The number of edges of each switch node: It is best that the number of edges of each switch node has nothing to do with the network size, so that it is easier to expand
  6. Fixed side length: For scalability reasons, the best case is that the nodes and edges of the network can be arranged in a three-dimensional space, so that the maximum side length is a constant independent of the network size

Common topology

Many common topologies are given below,not interestedStudents only need to understand the ring structure (Ring) and the hypercube (Hybercube)

In the illustration, the circle represents the switch and the square represents the processor. The switch is a means to control the communication of the processor to ensure the safety and reliability of its communication. The specific content can be seen in the textbook.not interestedCan not watch

  • Ring structure

    As the name suggests, the processors are connected in a line with one processor, two processors, and a closed loop.

  • Two-dimensional mesh network (direct topology)

    image-20200311113348829

    Benefits: good scalability (switch side is constant)

  • Binary tree network (indirect topology)

    image-20200311113420265

    Benefit: small diameter

    Disadvantages: the bisection bandwidth is small (1)

  • Hypertree network (indirect topology)

    image-20200311113614220

    Benefits: smaller diameter, larger bifurcation bandwidth than binary tree

    The 4-branch hypertree network is superior to the binary tree network in almost every aspect. It has few switching nodes, small diameter and large bisecting bandwidth

  • Butterfly network

    image-20200311114450825

  • Hypercube network

    image-20200311114643325

    Diameter: logn, bisecting bandwidth n/2

    Transmission method: There is only one binary bit difference on the adjacent side, and a path can be generated by sequentially changing the difference binary bit between the initial point and the target point

  • Shuffle-switched network

image-20200311114935784

​ This is a compromise solution with fixed sides, smaller diameters and better bisecting bandwidth

summary

image-20200311115312122

7.6.2 Broadcast & Reduction

B&R operations are mainly divided into the following types

  • one-to-all broadcast,all-to-one reduction
  • all-to-all broadcast,all-to-all reduction
  • gather,scatter
  • all-to-all exchange(Personalized Communication)

Note that when discussing each operation must be explainedTopological network currently attached

One-to-all broadcast,All-to-one reduction

The basic effects of this communication are as follows

image-20200408185855447

Take the ring topology as an example (the numbers on the dashed line represent the sequence of occurrence, and the arrows represent the data sent from the starting point to the end point)

image-20200408185855447
image-20200408185855447

All-to-all broadcast,All-to-all reduction

The effect of this communication is as follows

image-20200408190345092

Take the ring topology as an example (Broadcast) as follows (the numbers on the dotted line represent the sequence of occurrence, and the arrows represent the data sent from the start point to the end point)

(The following is one of the simplest and most simple methods, using p times one-to-all broadcast can easily achieve this effect, although it is not efficient)

image-20200408190450467

An example of a two-dimensional grid structure (Broadcast) is as follows (the broadcast is only halfway through, and it can be completed by another communication between columns)

image-20200408190548282

Gather,scatter

Scatter has a certain similarity with One-to-all Broadcast (the information source is unique and sent to all processes). In fact, if in the process of implementing One-to-all Broadcast, each node is made to intercept the messages it needs, And remove it from the broadcast content to realize Scatter

image-20200408190830320

All-to-all exchange(All-to-all Personalized Communication)

image-20200408190940625

To understand from the left, M i, j M_{i,j}Mi,jMeans existing in iiProcess i , to be sent tojjMessage of process j (the opposite is true on the right)

Guess you like

Origin blog.csdn.net/Kaiser_syndrom/article/details/105394899