为什么我的程序在完全循环8192个元素时会变慢？

本文翻译自：Why is my program slow when looping over exactly 8192 elements?

Here is the extract from the program in question. 以下是相关程序的摘录。 The matrix img[][] has the size SIZE×SIZE, and is initialized at: 矩阵img[][]的大小为SIZE×SIZE，并初始化为：

img[j][i] = 2 * j + i

Then, you make a matrix res[][] , and each field in here is made to be the average of the 9 fields around it in the img matrix. 然后，你创建一个矩阵res[][] ，这里的每个字段都是img矩阵中它周围的9个字段的平均值。 The border is left at 0 for simplicity. 为简单起见，边框保留为0。

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. 这就是该计划的全部内容。 For completeness' sake, here is what comes before. 为了完整起见，以下是之前的内容。 No code comes after. 没有代码。 As you can see, it's just initialization. 如您所见，它只是初始化。

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, eg the execution times: 基本上，当SIZE是2048的倍数时，此程序很慢，例如执行时间：

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. 编译器是GCC。 From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here. 据我所知，这是因为内存管理，但我对这个主题并不太了解，这就是我在这里问的原因。

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough. 另外如何解决这个问题会很好，但如果有人能解释这些执行时间，我已经足够开心了。

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help. 我已经知道malloc / free了，但问题不在于使用的内存量，它只是执行时间，所以我不知道这会有多大帮助。

#1楼

参考：https://stackoom.com/question/pSg6/为什么我的程序在完全循环-个元素时会变慢

#2楼

The difference is caused by the same super-alignment issue from the following related questions: 差异是由以下相关问题引起的相同超对齐问题引起的：

But that's only because there's one other problem with the code. 但那只是因为代码还有另外一个问题。

Starting from the original loop: 从原始循环开始：

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. 首先注意两个内环是微不足道的。 They can be unrolled as follows: 它们可以按如下方式展开：

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in. 这样就留下了我们感兴趣的两个外环。

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array? 现在我们可以看到问题在这个问题中是一样的：为什么在迭代2D数组时，循环的顺序会影响性能？

You are iterating the matrix column-wise instead of row-wise. 您是按列而不是按行迭代矩阵。

To solve this problem, you should interchange the two loops. 要解决此问题，您应该交换两个循环。

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two. 这完全消除了所有非顺序访问，因此您不再在大功率二次上获得随机减速。

Core i7 920 @ 3.5 GHz 酷睿i7 920 @ 3.5 GHz

Original code: 原始代码：

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops: 互换的外循环：

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

#3楼

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). 以下测试是使用Visual C ++编译器完成的，因为默认的Qt Creator安装使用它（我猜没有优化标志）。 When using GCC, there is no big difference between Mystical's version and my "optimized" code. 使用GCC时，Mystical的版本与我的“优化”代码之间没有太大区别。 So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). 所以结论是编译器优化比人类更好地处理微优化（我最后）。 I leave the rest of my answer for reference. 我留下余下的答案供参考。

It's not efficient to process images this way. 以这种方式处理图像效率不高。 It's better to use single dimension arrays. 最好使用单维数组。 Processing all pixels is the done in one loop. 处理所有像素是在一个循环中完成的。 Random access to points could be done using: 可以使用以下方法随机访问点：

pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each. 在这种特殊情况下，最好水平计算和缓存三个像素组的总和，因为它们每次使用三次。

I've done some tests and I think it's worth sharing. 我做了一些测试，我认为值得分享。 Each result is an average of five tests. 每个结果平均有五个测试。

Original code by user1615209: 用户1615209的原始代码：

8193: 4392 ms
8192: 9570 ms

Mystical's version: 神秘的版本：

8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average. 使用1D阵列的两次传递：第一次传递用于水平和，第二次用于垂直和和平均值。 Two pass addressing with three pointers and only increments like this: 两个传递寻址有三个指针，只有这样的增量：

imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}

8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this: 使用一维数组进行两次传递并进行如下寻址：

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}

8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache: 一次缓存水平求和只是前面一行，所以它们保留在缓存中：

// Horizontal sums for the first two lines
for(i=1;i<SIZE*2;i++){
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i<totalSize;i++){
    // Compute horizontal sum for next line
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
    // Final result
    resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}

8193: 599 ms
8192: 652 ms

Conclusion: 结论：

No benefits of using several pointers and just increments (I thought it would have been faster) 没有使用几个指针和只是增量的好处（我认为它会更快）
Caching horizontal sums is better than computing them several time. 缓存水平总和比计算几次更好。
Two pass is not three times faster, two times only. 两次通过不快三倍，仅两次。
It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result 使用单次传递和缓存中间结果可以快3.6倍

I'm sure it's possible to do much better. 我相信它可以做得更好。

NOTE Please, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. 注意请注意，我写了这个答案来解决一般性能问题，而不是Mystical的优秀答案中解释的缓存问题。 At the beginning it was just pseudo code. 一开始它只是伪代码。 I was asked to do tests in the comments... Here is a completely refactored version with tests. 我被要求在评论中做测试......这是一个完全重构的测试版本。

p15097962069

发布了0 篇原创文章 · 获赞 3 · 访问量 1万+

私信关注

为什么我的程序在完全循环8192个元素时会变慢？

#1楼

#2楼

#3楼

猜你喜欢