栈内矩阵相乘 v.s. 堆内矩阵相乘

1. 栈内数组相乘

可以定义局域变量：三个 500 x 500 的数组，共占据 3 x 500 x 500 x 8 = 6 x 10^6 bytes，差不多 6 M，栈空间约为不到 8 M。然后进行矩阵相乘，计时得到耗时。

2. 通过栈内指针 new 出堆内存，进行矩阵相乘

定义 double **a, **b, **c, new 出三个 500 x 500 的数组，即大约 3 x 500 个栈内指针，指向 3 x 500 x 500 x 8 < 6M 的堆内内存，进行矩阵相乘，计时得到耗时。

代码如下：

#include<iostream>
using namespace std;

#include<cmath>
#include<time.h>

int n=500;

double stack_multiply(void){

    double a[n][n], b[n][n], c[n][n];
    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            a[i][j] = i*j;
            b[i][j] = i*j;
        }
    }
    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            double y=0;
            for(int k=0;k<n;k++){
                y += a[i][k] * b[k][j];
            }
            c[i][j] = y;
        }
    }
    return c[200][200];
}

void heap_multiply(double **a, double **b, double **c){

    for(int i=0;i<n;i++){
        for(int j=0;j<n;j++){
            double y=0;
            for(int k=0;k<n;k++){
                y += a[i][k] * b[k][j];
            }
            c[i][j] = y;
        }
    }
}

int main(){

    clock_t t_start = clock();
    int repeat = 1E0;
    double y;
    for(int i=0;i<repeat;i++)
        y = stack_multiply();
    cout<<"\t\tc[200][200]="<<y<<endl;
    clock_t t_end = clock();
    cout<<" It took me "<< (double)(t_end- t_start)/repeat/CLOCKS_PER_SEC<<"s to do the matrix multiplication in stack."<<endl;

    double **a=new double *[n];
    for(int i=0;i<n;i++){
        a[i] = new double [n];
        for(int j=0;j<n;j++){
            a[i][j] = i*j;
        }
    }

    int piece = 1E4;
    /*
    double ***fragment = new double ** [piece];
    for(int i=0;i<piece;i++){
        fragment[i] = new double * [piece];
        for(int j=0;j<piece;j++){
            fragment[i][j] = new double [piece];
        }
    }
    */
    double **b=new double *[n];
    for(int i=0;i<n;i++){
        b[i] = new double [n];
        for(int j=0;j<n;j++){
            b[i][j] = i*j;
        }
    }

    double **c = new double * [n];
    for(int i=0;i<n;i++)
        c[i] = new double [n];
        

    t_start = clock();
    heap_multiply(a, b, c);
    t_end = clock();

    cout<<"\t\tc[200][200]="<<c[200][200]<<endl;

    for(int i=0;i<n;i++)
        delete [] a[i];
    delete [] a;
    
    /*
    for(int i=0;i<piece;i++){
        for(int j=0;j<piece;j++){
            delete []fragment[i][j];
        }
        delete [] fragment[i];
    }
    delete [] fragment;
    */
    
    for(int i=0;i<n;i++)
        delete [] b[i];
    delete [] b;
    for(int i=0;i<n;i++)
        delete [] c[i];
    delete [] c;

    cout<<" It took me "<< (double)(t_end- t_start)/CLOCKS_PER_SEC<<"s to do the matrix multiplication in heap."<<endl;

    return 0;
}

其中的 fragment 是模拟堆内存碎片化的。

注掉 fragment 这部分以后（堆上没有碎片化，矩阵 a 与矩阵 b 紧挨着），两种矩阵相乘耗时差不多，堆上的还稍微快一点，

g++ main.cpp

./a.out

　　　　c[200][200]=1.66167e+12

It took me 0.71875s to do the matrix multiplication in stack

　　　　c[200][200]=1.66167e+12

It took me 0.671875s to do the matrix multiplication in heap

注意到，如果编译加上 -O2，耗时变为原来的 1/3 多一些，

g++ main.cpp -O2

./a.out

　　　　c[200][200]=1.66167e+12

It took me 0.328125s to do the matrix multiplication in stack

　　　　c[200][200]=1.66167e+12

It took me 0.25s to do the matrix multiplication in heap

如果用 fragment 模拟堆碎片，设置 piece = 1E1，得到

g++ main.cpp -O2

./a.out

　　　　c[200][200]=1.66167e+12

It took me 0.296875s to do the matrix multiplication in stack

　　　　c[200][200]=1.66167e+12

It took me 0.234375s to do the matrix multiplication in heap

设置 piece = 1E2 也差不多:

g++ main.cpp -O2

./a.out

　　　　c[200][200]=1.66167e+12

It took me 0.28125s to do the matrix multiplication in stack

　　　　c[200][200]=1.66167e+12

It took me 0.203125s to do the matrix multiplication in heap

所以两种方式似乎差不多。这可能是因为，从堆中读内存时，会一次得到该单元附近一块单元的值，所以只要不是离散地跳来跳去取值，这一点都会使程序变得较快。

但在测试过程中，有一次，我得到，piece = 1E2 时，第二种变得非常慢。所以似乎也不完全确定。

栈内矩阵相乘 v.s. 堆内矩阵相乘

猜你喜欢