并行与分布式计算导论练习题指导（一）

练习题指导（一）

第一题 CC操作的伪代码

回顾MPI的通信机制，写出如下Collective Communications操作的伪代码：

One-to-all Broadcast

All-to-all Reduction

Scatter

思路

选定通讯框架
考虑广播和规约的实现
- 对于已经有对偶实现的部分（如已知one-to-all broadcast 求 one-to-all reduction）,大体上只需将顺序倒置（将原本的循环迭代顺序改为倒序，如0至100改为100至0；将收发的顺序做调整），然后该框架下的最简单实例上人脑验证一遍即可
- 对于没有对偶实现的部分，只需找到最简单的方法实现要求的内容即可（例如对于一个ring求all-to-all broadcast，我就执行p次one-to-all broadcast完事）

参考答案

下面全部以超立方体为例，超立方体结构实现上述三个通讯相对难一些，我觉得借鉴一下下面的框架，回过头去写个环形结构（Ring）的伪码应该是非常容易的

One-to-all Broadcast

需要注意的有三点
其一，超立方体的通信仅发生在二进制标号相差1位的进程之间（如11000与11010）；
其二，在一个处理器向外发送信息之前，他必须先接收到信息（信息源除外）；
其三，如果要进行高效的实现，就必须保证不会有重复的发送（也就是说，每一个处理器其信息来源是唯一的），在这个前提下，为了避免死锁，一个处理器A在发送消息时也只能发给将A作为唯一信息来源的处理器（否则A将持续请求send，在一些通讯协议下将构成死锁）

下面给出基于Hypercube的one-to-all broadcast

void One_to_all_Broadcast(
    int d,//超立方体维数
    int myId,//当前处理器标号
    int sourceId,//信息源标号
    MessaageType& message//消息
)
{
    int virtualId=myId ^ sourceID;//获得当前处理器相对于信息源的标号
    int mask=power(2,d)-1;//mask是一个二进制位全部为1的变量,用来控制发送权限
   	for(int i=d-1;i>=0;i--)//
    {
        int temp=1<<i;
        mask=mask^temp//mask的第i位将被设为0
        if(virtualId&mask==0)//判断收发权限,记该条件第一次成立时i=s
    	{
        	if(virtualID&temp==0)//一个进程只能向s及其低位不同于自己的进程发送消息
            {
                int virtualDes=virtualID^temp;//计算第i位不同于自己位置
                int des=virtualDes^source;//将相对位置转化为绝对位置
                send(message,Des);//向该位置发送消息
            }
            else//一个进程只能从第s位不同于自己的进程获得消息
            {
                int virtualSource=virtualID^temp;//计算第i位不同于自己位置
                int Sour=virtualSource^source;//将相对位置转化为绝对位置
                receive(message,Sour);//从该位置接收消息
            }
        }
    }
}

All-to-all Reduction

基于Hypercube的all-to-all reduction（以sum运算为例）

All-to-all的运算是相对简单的，因为所有处理器都将和所有处理器进行交互。
然而还是需要注意，交互的顺序是有讲究的。对于d维超立方体来说，最优的ata将在d步时完成reduction或broadcast。具体的思路大体就是使得在每一步中，选定一个平行棱等价类，并使得这个平行棱组中每一条棱的两端彼此交换信息。
听起来比较抽象，我举个例子：对于一个三维超立方体来说，以其某个顶点为原点，以该顶点出发的三条棱为xyz轴。第一步，选定所有平行于x轴的棱，所有被选定的棱的两端的处理器通过棱进行信息传递；第二步，对y轴的平行棱进行上述操作；第三步，对z轴的平行棱进行上述操作。

void All_to_all_reduction(
	int d,//超立方体维数
    int myId,//当前处理器标号
    int myMessage,
    int receiveMessage,
    int result
)
{
    result=myMessage;
    for(int i=d-1;i>=0;i--)
    {
        int partner=myID^(1<<i);
        send(myMessage,partner);
        receive(reciveMessage,partner);
        result+=reciveMessage;
    }
}

Scatter

基于Hypercube的Scatter

在原有的one-to-all broadcast的基础上，增加了从message中截取所需消息，保存，并将其从message中移除的部分（懒省事XD）

void Scatter(
    int d,//超立方体维数
    int myId,//当前处理器标号
    int sourceId,//信息源标号
    MessaageType& message//接收到的消息
)
{
    MessageType result;
    if(myId=sourceId)getAndDeleta(myId,message);//信息源截取自己的部分，将剩余部分广博出去
    int virtualId=myId ^ sourceID;//获得当前处理器相对于信息源的标号
    int mask=power(2,d)-1;//mask是一个二进制位全部为1的变量,用来控制发送权限
   	for(int i=d-1;i>=0;i--)//
    {
        int temp=1<<i;
        mask=mask^temp//mask的第i位将被设为0
        if(virtualId&mask==0)//判断收发权限,记该条件第一次成立时i=s
    	{
        	if(virtualID&temp==0)//一个进程只能向s及其低位不同于自己的进程发送消息
            {
                int virtualDes=virtualID^temp;//计算第i位不同于自己位置
                int des=virtualDes^source;//将相对位置转化为绝对位置
                send(message,Des);//向该位置发送消息
            }
            else//一个进程只能从第s位不同于自己的进程获得消息
            {
                int virtualSource=virtualID^temp;//计算第i位不同于自己位置
                int Sour=virtualSource^source;//将相对位置转化为绝对位置
               	receive(message,Sour);//从该位置接收消息
                result=getAndDeleta(myId,message)//截取本进程对应消息
            }
        }
    }
	message=result;//将手中的消息转化为自己所需要收到的消息
}

第二题 Floyd算法的等效关系与可扩展性函数

对于Floyd算法，存在一种MPI的实现使得每个处理器需要花费 $Θ (n 2 l o g p)$ 进行通信，且问题规模n时所需的内存容量为 $n^2$ ，也即 $M(n) =n^2$ 。求该系统的等效率关系与可扩展性函数。

思路

这种题的解法非常简单，大题分为三步

确定串行时间复杂度 $T (n, 1)$ ，确定所有进程花费在原串行时间以外的所有额外时间 $T_0(n,p)$ ，确定问题规模与内存容量的相关关系M(n)
计算等效关系 $T(n,1)\ge CT_0(n,p) \Longrightarrow n\ge f(p)$
计算可扩展性函数 $M (f (p)) / p$ 的复杂度级别

注意，根据 $T_0(n,p)$ 的定义，其应该包含通信时间+串行时间（因为串行部分被额外执行了p-1次），然而在等效关系中，额外时间被视为p的函数（因为我们把p视为变量，并讨论在p变化时不等式的成立情况，n在每一次分析中均为常量），故而实际上，在等效关系的计算中 $T_0(n,p)=\Theta( p\kappa(n,p))$ ，即其仅考虑p个处理器的通讯耗时

参考答案

由题目可知
$T(n,1)=\Theta(n^3)\\T_0(n,p)=\Theta(n^2 plogp)\\M(n)=n^2$
从而等效关系为
$n^3\ge Cn^2plogp\Longrightarrow n\ge Cplogp$
可扩展性函数为
$M(Cplogp)/p=\frac {C^2p^2log^2p}{p}={C^2plog^2p}=\Theta(plog^2p)$

第三题求素数

完成求素数（#4），给出不同问题规模及处理器数量下的性能比较与分析。

编写MPI并行程序，寻找小于整数scope的全部素数。
输入：两个正整数p和k，p是并行程序使用的处理器数量，k表示scope=2^k
输出：将找到的全部素数按照从小到大的顺序，存储在名为"ref.out"的二进制文件中。每个素数以64位整数表示。

书上的示例代码（计算素数个数）

书上给出了标准版代码，我们可以在这个的基础上来解决素数问题
书上代码实现的目标是计算0-scope之间的素数个数并输出，使用的是Eratosthenes素数筛。
代码虽然比较长，但阅读起来相对容易理解。建议先看教材代码再看改进版，不然核心部分不容易看懂。

书上代码的改进版

下面展示的是书上的代码改进版（即预先剔除所有偶数），我们将在这个的基础上来解决素数问题。改代码可以相对高效的给出0-scope间的素数的总数

我的思路：删除该代码的输出，使用gather合并他们的标记数组，在根线程遍历并输出未被标记的数字即可。特别值得注意的是，该代码将scope大小的数组缩减为scope/2的仅奇数数组，所以输出的时候需要注意一下

#include"mpi.h"
#include<math.h>
#include<stdio.h>
#include<stdlib.h>
#define MIN(a,b)  ((a)<(b)?(a):(b))//宏定义较小值
//cmd调试命令：mpiexec -n processnum C:\Users\socrali\source\repos\MPIProject\x64\Debug\MPIProject.exe scope
int main(int argc, char* argv[])
{
    
    
    int    count;        /* Local prime count */
    double elapsed_time; /* Parallel execution time */
    int    first;        /* Index of first multiple */
    int    global_count; /* Global prime count */
    int    high_value;   /* Highest value on this proc */
    int    i;
    int    id;           /* Process ID number */
    int    index;        /* Index of current prime */
    int    low_value;    /* Lowest value on this proc */
    char* marked;       /* Portion of 2,...,'n' */
    int    n, m;            /* Sieving from 2, ..., 'n' */
    int    p;            /* Number of processes */
    int    proc0_size;   /* Size of proc 0's subarray */
    int    prime;        /* Current prime */
    int    size;         /* Elements in 'marked' */

    MPI_Init(&argc, &argv);

    /* Start the timer */

    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &p);
    MPI_Barrier(MPI_COMM_WORLD);
    elapsed_time = -MPI_Wtime();

    if (argc != 2) {
    
    
        if (!id) printf("Command line: %s <m>\n", argv[0]);
        MPI_Finalize();
        exit(1);
    }

    n = atoi(argv[1]);
    m = n;//
    n = (n % 2 == 0) ? (n / 2 - 1) : ((n - 1) / 2);//将输入的整数n转换为存储奇数的数组大小，不包括奇数1
    //if (!id) printf ("Number of odd integers:%d    Maximum value of odd integers:%d\n",n+1,3+2*(n-1));
    if (n == 0) {
    
    //输入2时，输出1 prime，结束
        if (!id) printf("There are 1 prime less than or equal to %d\n", m);
        MPI_Finalize();
        exit(1);
    }
    /* Figure out this process's share of the array, as
       well as the integers represented by the first and
       last array elements */

    low_value = 3 + 2 * (id * (n) / p);//进程的第一个数
    high_value = 3 + 2 * ((id + 1) * (n) / p - 1);//进程的最后一个数
    size = (high_value - low_value) / 2 + 1;    //进程处理的数组大小


    /* Bail out if all the primes used for sieving are
       not all held by process 0 */

    proc0_size = (n - 1) / p;

    if ((3 + 2 * (proc0_size - 1)) < (int)sqrt((double)(3 + 2 * (n - 1)))) {
    
    //
        if (!id) printf("Too many processes\n");
        MPI_Finalize();
        exit(1);
    }

    /* Allocate this process's share of the array. */

    marked = (char*)malloc(size);

    if (marked == NULL) {
    
    
        printf("Cannot allocate enough memory\n");
        MPI_Finalize();
        exit(1);
    }

    for (i = 0; i < size; i++) marked[i] = 0;
    if (!id) index = 0;
    prime = 3;//从素数3开始
    do {
    
    
        //确定奇数的第一个倍数的下标
        if (prime * prime > low_value)
            first = (prime * prime - low_value) / 2;
        else {
    
    
            if (!(low_value % prime))
                first = 0;
            else
                first = ((prime - low_value % prime) % 2 == 0) ? ((prime - low_value % prime) / 2) : ((prime - low_value % prime + prime) / 2);
        }

        for (i = first; i < size; i += prime)  marked[i] = 1;
        if (!id) {
    
    
            while (marked[++index]);
            prime = 2 * index + 3;//下一个未被标记的素数
        }
        if (p > 1) MPI_Bcast(&prime, 1, MPI_INT, 0, MPI_COMM_WORLD);
    } while (prime * prime <= 3 + 2 * (n - 1));//

    count = 0;
    for (i = 0; i < size; i++)
        if (!marked[i])  count++;
    if (p > 1) MPI_Reduce(&count, &global_count, 1, MPI_INT, MPI_SUM,
        0, MPI_COMM_WORLD);

    /* Stop the timer */

    elapsed_time += MPI_Wtime();


    /* Print the results */

    if (!id) {
    
    
        printf("There are %d primes less than or equal to %d\n",
            global_count + 1, m);//前面程序是从素数3开始标记，忽略了素数2，所以素数个数要加1
        printf("SIEVE (%d) %10.6f\n", p, elapsed_time);
    }
    MPI_Finalize();
    return 0;
}

笔者的代码

老哥们，我的代码仅供参考和学习，我也要交作业

#include"mpi.h"
#include<math.h>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#define MIN(a,b)  ((a)<(b)?(a):(b))
using namespace std;
//cmd调试命令：mpiexec -n 10 C:\Users\56875\source\repos\MPIProject\x64\Debug\MPIProject.exe 20
//测试数据量不超过2^30，未使用int64 
int main(int argc, char* argv[])
{
    
    
    int    first;        /* Index of first multiple */
    double elapsed_time; /* Parallel execution time */
    int    high_value;   /* Highest value on this proc */
    int    i;
    int    id;           /* Process ID number */
    int    index;        /* Index of current prime */
    int    low_value;    /* Lowest value on this proc */
    bool*  marked;       /* Portion of 2,...,'n' */
    int    n, m;            /* Sieving from 2, ..., 'n' */
    int    p;            /* Number of processes */
    int    proc0_size;   /* Size of proc 0's subarray */
    int   prime;        /* Current prime */
    int   size;         /* Elements in 'marked' */
    bool* totalmark;
    int * sizeInProc;
    int * beginLoc;

    MPI_Init(&argc, &argv);
    /* Start the timer */

    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &p);
    MPI_Barrier(MPI_COMM_WORLD);

    elapsed_time = -MPI_Wtime();

    if (argc != 2) {
    
    
        if (!id) printf("Command line: %s <m>\n", argv[0]);
        MPI_Finalize();
        exit(1);
    }

    n = atoi(argv[1]);
    n = (1 << n);
    m = n;//
    n = (n % 2 == 0) ? (n / 2 - 1) : ((n - 1) / 2);//将输入的整数n转换为存储奇数的数组大小，不包括奇数1
    if (n == 0) {
    
    //输入2时，输出2结束
        if (!id) printf("2\n");
        MPI_Finalize();
        exit(1);
    }

    low_value = 3 + 2 * (id * (n) / p);//进程的第一个数
    high_value = 3 + 2 * ((id + 1) * (n) / p - 1);//进程的最后一个数
    size = (high_value - low_value) / 2 + 1;    //进程处理的数组大小

    //为合并结果做准备
    sizeInProc = new int[p];
    beginLoc = new int[p];
    MPI_Allgather(&size, 1, MPI_INT, sizeInProc, 1, MPI_INT, MPI_COMM_WORLD);
    beginLoc[0] = 0;
    for (int i = 1; i < p; i++)
    {
    
    
        beginLoc[i] = beginLoc[i - 1] + sizeInProc[i - 1];
    }

    proc0_size = (n - 1) / p;

    //进程过多
    if ((3 + 2 * (proc0_size - 1)) < (int)sqrt((double)(3 + 2 * (n - 1))))
    {
    
    
        if (!id) printf("Too many processes\n");
        MPI_Finalize();
        exit(1);
    }

    //内存分配失败
    marked = (bool*)malloc(size);
    if (marked == NULL) {
    
    
        printf("Cannot allocate enough memory\n");
        MPI_Finalize();
        exit(1);
    }

    for (i = 0; i < size; i++) marked[i] = 0;
    if (!id) index = 0;
    prime = 3;//从素数3开始
    do {
    
    
        //确定奇数的第一个倍数的下标
        if (prime * prime > low_value)
            first = (prime * prime - low_value) / 2;
        else {
    
    
            if (!(low_value % prime))
                first = 0;
            else
                first = ((prime - low_value % prime) % 2 == 0) ? ((prime - low_value % prime) / 2) : ((prime - low_value % prime + prime) / 2);
        }

        for (i = first; i < size; i += prime)  marked[i] = 1;//第一次加因为会变成偶数，所以已经除去了，故而每次实际上都是加prime*2，映射到奇数数组里就是i+=prime
        if (!id) {
    
    
            while (marked[++index]);
            prime = 2 * index + 3;//下一个未被标记的素数
        }
        if (p > 1) MPI_Bcast(&prime, 1, MPI_INT, 0, MPI_COMM_WORLD);
    } while (prime * prime <= 3 + 2 * (n - 1));//

    //合并结果
    if (p > 1)
    {
    
    
        totalmark = new bool[n] {
    
    0};//这里照理来说只需要给0号线程声明totalmark，但我这样执行就会报错，强行编译之后也不能跑
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Gatherv(marked, size, MPI_C_BOOL, totalmark, sizeInProc, beginLoc, MPI_C_BOOL, 0, MPI_COMM_WORLD);
    }

    elapsed_time += MPI_Wtime();

    //打印结果
    if (!id) {
    
    
        printf("Processor Number: (%d)\nTime Cost: %10.6f s\n", p, elapsed_time);
        printf("No.1 2\n");
        int count = 2;
        for (int i = 0; i < n; i++)
        {
    
    
            if (!totalmark[i])
            {
    
    
                printf("No.%d %d\n", count, 2 * i + 3);
                count++;
            }
        }
    }
    MPI_Finalize();
    return 0;
}

代码执行结果如下（10000以内素数 10个线程）
在这里插入图片描述

第四题大和小

吐槽：用OpenMP就一个for搞定，非要用MPI

完成“大”和“小”（#52），给出不同问题规模及进程数下的性能比较与分析。

这题这么简单，还是那句话，不要照搬

不废话，上源码

#include"mpi.h"
#include<math.h>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include"fio.h"
using namespace std;
//cmd调试命令：mpiexec -n 10 C:\Users\socrali\source\repos\MPIProject\x64\Debug\MPIProject.exe 10
//struct FILE_SEG {
    
    
    //int64_t offset;
    //int64_t width;
    //int64_t stride;
//};
inline bool mymax(int64_t a,int64_t b)
{
    
    
    return a > b;
}
//int64_t input_data(void* buf, int64_t count, FILE_SEG fseg);
//int64_t output_data(void* buf, int64_t count, FILE_SEG fseg);
int main(int argc, char* argv[])
{
    
    
    MPI_Init(&argc, &argv);
    /* Start the timer */
    int id, p;
    int n;
    int64_t* A;
    int64_t* B;
    FILE_SEG read,out;
    read.offset = 0; read.stride = 8;read.width = 8;//太懒了，直接全读，函数可扩展性捉急
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &p);
    MPI_Barrier(MPI_COMM_WORLD);
    n = atoi(argv[1]);
    n = (1 << n);
    A = new int64_t [n + 2];
    B = new int64_t [n + 2];
    input_data(A, 8*n+16, read);
    
    //任务分配
    //分配前半截即可
    if(!id)printf("Mission Size: %d\n",n);
    int64_t leftover = (n/2) % p;//零散任务数
    int64_t base = (n/2) / p;//基准任务数
    int64_t size[15] = {
    
     0 };//每个处理器处理的任务量
    int64_t offset[15] = {
    
     0 };//每个处理器处理的起始点
    for (int i = 0; i < p; i++)
    {
    
    
        if (i < leftover)size[i] = base + 1;
        else size[i] = base;
        if (i)offset[i] = size[i - 1] + offset[i - 1];
    }
    printf("%d process size: %d\n", id,size[id]);


    //开始干活
    int workbound = offset[id] + size[id];
    for (int i = offset[id]; i < workbound; i++)
    {
    
    
        if (mymax(A[i], A[n - 1 - i]))//前者大
        {
    
    
            B[i] = A[n - 1 - i];
            B[n - 1 - i] = A[i];
        }
        else//后者大
        {
    
    
            B[i] = A[i];
            B[n - 1 - i] = A[n - 1 - i];
        }
    }

    //合并结果&输出
    out.offset = 8*offset[id];
    out.stride = 8;
    out.width = 8;
    output_data(B, 8*size[id],out);
    MPI_Finalize();
    return 0;
}

需要注意的是#include<stdio.h>必须出现在#include"fio.h"之前，因为int64_t是stdio中定义的（它把long long int 定义为了 int64_t）

附言

今年我在网上查的时候，这两道题在网上都没有参考答案，我在这里把这些放出来，希望以后大家可以少浪费不必要的时间。
鱼生苦短，争取更咸！！