栈内矩阵相乘 v.s. 堆内矩阵相乘

1. 栈内数组相乘

可以定义局域变量:三个 500 x 500 的数组,共占据 3 x 500 x 500 x 8 = 6 x 10^6 bytes,差不多 6 M,栈空间约为 不到 8 M。然后进行矩阵相乘,计时得到耗时。

2. 通过栈内指针 new 出堆内存,进行矩阵相乘

定义 double **a, **b, **c, new 出三个 500 x 500 的数组,即大约 3 x 500 个栈内指针,指向 3 x 500 x 500 x 8 < 6M 的堆内内存,进行矩阵相乘,计时得到耗时。

代码如下:

#include<iostream>
using namespace std;

#include<cmath>
#include<time.h>

int n=500;

double stack_multiply(void){

    double a[n][n], b[n][n], c[n][n];
    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            a[i][j] = i*j;
            b[i][j] = i*j;
        }
    }
    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            double y=0;
            for(int k=0;k<n;k++){
                y += a[i][k] * b[k][j];
            }
            c[i][j] = y;
        }
    }
    return c[200][200];
}

void heap_multiply(double **a, double **b, double **c){

    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            double y=0;
            for(int k=0;k<n;k++){
                y += a[i][k] * b[k][j];
            }
            c[i][j] = y;
        }
    }
}

int main(){

    clock_t t_start = clock();
    int repeat = 1E0;
    double y;
    for(int i=0;i<repeat;i++)
        y = stack_multiply();
    cout<<"\t\tc[200][200]="<<y<<endl;
    clock_t t_end = clock();
    cout<<" It took me "<< (double)(t_end- t_start)/repeat/CLOCKS_PER_SEC<<"s to do the matrix multiplication in stack."<<endl;

    double **a=new double *[n];
    for(int i=0;i<n;i++){
        a[i] = new double [n];
        for(int j=0;j<n;j++){
            a[i][j] = i*j;
        }
    }

    int piece = 1E4;
    /*
    double ***fragment = new double ** [piece];
    for(int i=0;i<piece;i++){
        fragment[i] = new double * [piece];
        for(int j=0;j<piece;j++){
            fragment[i][j] = new double [piece];
        }
    }
    */
    double **b=new double *[n];
    for(int i=0;i<n;i++){
        b[i] = new double [n];
        for(int j=0;j<n;j++){
            b[i][j] = i*j;
        }
    }

    double **c = new double * [n];
    for(int i=0;i<n;i++)
        c[i] = new double [n];
        

    t_start = clock();
    heap_multiply(a, b, c);
    t_end = clock();

    cout<<"\t\tc[200][200]="<<c[200][200]<<endl;

    for(int i=0;i<n;i++)
        delete [] a[i];
    delete [] a;
    
    /*
    for(int i=0;i<piece;i++){
        for(int j=0;j<piece;j++){
            delete []fragment[i][j];
        }
        delete [] fragment[i];
    }
    delete [] fragment;
    */
    
    for(int i=0;i<n;i++)
        delete [] b[i];
    delete [] b;
    for(int i=0;i<n;i++)
        delete [] c[i];
    delete [] c;

    cout<<" It took me "<< (double)(t_end- t_start)/CLOCKS_PER_SEC<<"s to do the matrix multiplication in heap."<<endl;

    return 0;
}

其中的 fragment 是模拟堆内存碎片化的。

注掉 fragment 这部分以后(堆上没有碎片化,矩阵 a 与矩阵 b 紧挨着),两种矩阵相乘耗时差不多,堆上的还稍微快一点,

g++ main.cpp

./a.out

    c[200][200]=1.66167e+12

 It took me 0.71875s to do the matrix multiplication in stack

    c[200][200]=1.66167e+12

 It took me 0.671875s to do the matrix multiplication in heap

注意到,如果编译加上 -O2,耗时变为原来的 1/3 多一些,

g++ main.cpp -O2

./a.out

    c[200][200]=1.66167e+12

 It took me 0.328125s to do the matrix multiplication in stack

    c[200][200]=1.66167e+12

 It took me 0.25s to do the matrix multiplication in heap

如果用 fragment 模拟堆碎片,设置 piece = 1E1,得到

g++ main.cpp -O2

./a.out

    c[200][200]=1.66167e+12

 It took me 0.296875s to do the matrix multiplication in stack

    c[200][200]=1.66167e+12

 It took me 0.234375s to do the matrix multiplication in heap

设置 piece = 1E2 也差不多:

g++ main.cpp -O2

./a.out

    c[200][200]=1.66167e+12

 It took me 0.28125s to do the matrix multiplication in stack

    c[200][200]=1.66167e+12

 It took me 0.203125s to do the matrix multiplication in heap

所以两种方式似乎差不多。这可能是因为,从堆中读内存时,会一次得到该单元附近一块单元的值,所以只要不是离散地跳来跳去取值,这一点都会使程序变得较快。

但在测试过程中,有一次,我得到,piece = 1E2 时,第二种变得非常慢。所以似乎也不完全确定。

猜你喜欢

转载自www.cnblogs.com/luyi07/p/10503283.html