CSAPP Experiment 4: Performance Optimization Experiment (Perflab)

       This series of articles is an experiment arranged by the basic course "Computer System" of the computer major of the University of Science and Technology of China. The textbook and content used in the class are the black book CSAPP. At that time, it took a lot of energy and detours. Now I will summarize each experiment. This article is the fourth Experiment - Performance Optimization Experiment (Perflab).

1. Experiment name: perflab

2. Experimental hours: 3

3. Experiment content and purpose:

       This experiment optimizes the image processing code. Image processing provides many functions that can be improved by optimization. In this lab, we'll consider two image processing operations: rotate , which rotates the image 90° counterclockwise, and smooth , which "smoothes" or "blurs" the image.

       For this experiment, we will consider the image to be represented by a two-dimensional matrix M  , and denote the pixel value at position (i, j) by M i,j . A pixel value is a triplet of red, green, and blue (RGB) values. We only consider square images. Let N denote the  number of rows (and columns) of the image. Rows and columns are numbered in C style - from 0 to N - 1  .

       Under this representation, the rotate  operation can be implemented simply by combining the following two matrix operations:

  • Transpose: For each (i,j), swap M i,j  with M j,i  .
  • Row Swap: Swap row i with row N  - 1 - i.

       See the picture below for details:

        The smooth  operation can be calculated by taking the mean of each pixel and the surrounding pixels (at most a 3×3 nine-square grid centered on the pixel). See Figure 2 for details, the pixels M2[1][1] and M2[N - 1][N - 1] are given by:

 4. Experimental steps and results:

        First copy perflab-handout.tar to a protected folder for this experiment.

        Then enter the command at the command line: tar xvf perflab-handout.tar This will extract several files to the current directory.

        The only file you can make changes and finally commit is the kernels.c program driver.c is a driver that you can use to evaluate the performance of your solution. Use the command: make driver to generate the driver code, and use the command ./driver to make it run.

        Looking at the file kernels.c, you will find a C struct: team. You will need to fill in the information of your group members. Please fill it out right away in case you forget.

      1.naive_rotate

/*
 *naive_rotate - The naive baseline version of rotate
 */
char naive_rotate_descr[] ="naive_rotate: Naive baseline implementation";
void naive_rotate(int dim, pixel *src,pixel *dst)
{
   int i, j;
 
   for (i = 0; i < dim; i++)
         for(j = 0; j < dim; j++)
             dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j,dim)];
}

         The macro definition is found in the header file defs.h:

#defineRIDX(i,j,n) ((i)*(n)+(j))

        Then this code can easily rotate a picture, it will reposition all the pixels in rows and columns, resulting in a 90 degree rotation of the entire picture.

        However, because the step size of this string of codes is too long, the cache hit rate is very low, so the overall operation efficiency is low. Therefore, considering the size of the cache, 32 pixels should be stored in sequence (column storage) when storing. (The arrangement of 32 pixels is to make full use of the first-level cache (32KB), and the block strategy is adopted, and the size of each block is 32)

        This can be cache friendly and can greatly improve efficiency.

      2.perf_rotate

        Consider rectangular block 32*1, 32-way loop unrolling, and make the dest address continuous to reduce the number of memory writes

//宏定义一个复制函数,方便程序编写
#define COPY(d,s) *(d)=*(s)
char rotate_descr[] = "rotate: Currentworking version";

void rotate( int dim,pixel *src,pixel *dst)
{
   int i,j;
   for(i=0;i<dim;i+=32)//32路循环展开,32个像素依次存储
         for(j=dim-1;j>=0;j-=1)
{       
         pixel*dptr=dst+RIDX(dim-1-j,i,dim);
         pixel*sptr=src+RIDX(i,j,dim);
//进行复制操作
         COPY(dptr,sptr);sptr+=dim;
         COPY(dptr+1,sptr);sptr+=dim;
         COPY(dptr+2,sptr);sptr+=dim;
         COPY(dptr+3,sptr);sptr+=dim;
         COPY(dptr+4,sptr);sptr+=dim;
         COPY(dptr+5,sptr);sptr+=dim;
         COPY(dptr+6,sptr);sptr+=dim;
         COPY(dptr+7,sptr);sptr+=dim;
         COPY(dptr+8,sptr);sptr+=dim;
         COPY(dptr+9,sptr);sptr+=dim;
         COPY(dptr+10,sptr);sptr+=dim;
         COPY(dptr+11,sptr);sptr+=dim;
         COPY(dptr+12,sptr);sptr+=dim;
         COPY(dptr+13,sptr);sptr+=dim;
         COPY(dptr+14,sptr);sptr+=dim;
         COPY(dptr+15,sptr);sptr+=dim;
         COPY(dptr+16,sptr);sptr+=dim;
         COPY(dptr+17,sptr);sptr+=dim;
         COPY(dptr+18,sptr);sptr+=dim;
         COPY(dptr+19,sptr);sptr+=dim;
         COPY(dptr+20,sptr);sptr+=dim;
         COPY(dptr+21,sptr);sptr+=dim;
         COPY(dptr+22,sptr);sptr+=dim;
         COPY(dptr+23,sptr);sptr+=dim;
         COPY(dptr+24,sptr);sptr+=dim;
         COPY(dptr+25,sptr);sptr+=dim;
         COPY(dptr+26,sptr);sptr+=dim;
         COPY(dptr+27,sptr);sptr+=dim;
         COPY(dptr+28,sptr);sptr+=dim;
         COPY(dptr+29,sptr);sptr+=dim;
         COPY(dptr+30,sptr);sptr+=dim;
         COPY(dptr+31,sptr);
}
}

      3.smooth

char naive_smooth_descr[] ="naive_smooth: Naive baseline implementation";
void naive_smooth(int dim, pixel *src,pixel *dst)
{
   int i, j;
 
   for (i = 0; i < dim; i++)
   for (j = 0; j < dim; j++)
       dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}

        This code frequently calls the avg function, and the initialize_pixel_sum, accumulate_sum, assign_sum_to_pixel functions are also frequently called in the avg function, and it also contains two layers of for loops, and we should reduce the time overhead of function calls. Therefore, the code needs to be rewritten to not call the avg function. 

        Special circumstances, special treatment:

        The Smooth function processing is divided into 4 blocks, one is the interior of the main body, which is averaged by 9 points; the other is 4 vertices, which is averaged by 4 points; the third is four boundaries, which is averaged by 6 points. Start processing from the top of the picture, then the upper boundary, and process it sequentially. When processing the left boundary, the for loop processes the main part of a line, so there is the following optimized code.

      4.perf_smooth

charsmooth_descr[] = "smooth: Current working version";
void smooth(intdim, pixel* src, pixel* dst) {
	int i, j;
	int dim0 = dim;
	int dim1 = dim - 1;
	int dim2 = dim - 2;
	pixel* P1, * P2, * P3;
	pixel* dst1;
	P1 = src;
	P2 = P1 + dim0;
	//左上角像素处理      采用移位运算代替avg的某些操作
	dst->red = (P1->red + (P1 + 1)->red + P2->red + (P2 + 1)->red) >> 2;          dst->green = (P1->green + (P1 + 1)->green + P2->green + (P2 + 1)->green) >> 2;          dst->blue = (P1->blue + (P1 + 1)->blue + P2->blue + (P2 + 1)->blue) >> 2;
	dst++;
	//上边界处理 
	for (i = 1; i < dim1; i++) {
		dst->red = (P1->red + (P1 + 1)->red + (P1 + 2)->red + P2->red + (P2 + 1)->red + (P2 + 2)->red) / 6;                     dst->green = (P1->green + (P1 + 1)->green + (P1 + 2)->green + P2->green + (P2 + 1)->green + (P2 + 2)->green) / 6;                      dst->blue = (P1->blue + (P1 + 1)->blue + (P1 + 2)->blue + P2->blue + (P2 + 1)->blue + (P2 + 2)->blue) / 6;
		dst++;
		P1++;
		P2++;
	}
	//右上角像素处理          dst->red=(P1->red+(P1+1)->red+P2->red+(P2+1)->red)>>2; 
	dst->green = (P1->green + (P1 + 1)->green + P2->green + (P2 + 1)->green) >> 2;          dst->blue = (P1->blue + (P1 + 1)->blue + P2->blue + (P2 + 1)->blue) >> 2;
	dst++;
	P1 = src;
	P2 = P1 + dim0;
	P3 = P2 + dim0;
	//左边界处理 
	for (i = 1; i < dim1; i++) {
		dst->red = (P1->red + (P1 + 1)->red + P2->red + (P2 + 1)->red + P3->red + (P3 + 1)->red) / 6;                                dst->green = (P1->green + (P1 + 1)->green + P2->green + (P2 + 1)->green + P3->green + (P3 + 1)->green) / 6;                               dst->blue = (P1->blue + (P1 + 1)->blue + P2->blue + (P2 + 1)->blue + P3->blue + (P3 + 1)->blue) / 6;
		dst++;
		dst1 = dst + 1;
	}
	//主体中间部分处理     
	for (j = 1; j < dim2; j += 2) {
		//同时处理2个像素          
		dst->red = (P1->red + (P1 + 1)->red + (P1 + 2)->red + P2->red + (P2 + 1)->red + (P2 + 2)->red + P3->red + (P3 + 1)->red + (P3 + 2)->red) / 9;
		dst->green = (P1->green + (P1 + 1)->green + (P1 + 2)->green + P2->green + (P2 + 1)->green + (P2 + 2)->green + P3->green + (P3 + 1)->green + (P3 + 2)->green) / 9;
		dst->blue = (P1->blue + (P1 + 1)->blue + (P1 + 2)->blue + P2->blue + (P2 + 1)->blue + (P2 + 2)->blue + P3->blue + (P3 + 1)->blue + (P3 + 2)->blue) / 9;
		dst1->red = ((P1 + 3)->red + (P1 + 1)->red + (P1 + 2)->red + (P2 + 3)->red + (P2 + 1)->red + (P2 + 2)->red + (P3 + 3)->red + (P3 + 1)->red + (P3 + 2)->red) / 9;
		dst1->green = ((P1 + 3)->green + (P1 + 1)->green + (P1 + 2)->green + (P2 + 3)->green + (P2 + 1)->green + (P2 + 2)->green + (P3 + 3)->green + (P3 + 1)->green + (P3 + 2)->green) / 9;
		dst1->blue = ((P1 + 3)->blue + (P1 + 1)->blue + (P1 + 2)->blue + (P2 + 3)->blue + (P2 + 1)->blue + (P2 + 2)->blue + (P3 + 3)->blue + (P3 + 1)->blue + (P3 + 2)->blue) / 9;
		dst += 2; dst1 += 2; P1 += 2; P2 += 2; P3 += 2;
	}
	for (; j < dim1; j++) {
		dst->red = (P1->red + (P1 + 1)->red + (P1 + 2)->red + P2->red + (P2 + 1)->red + (P2 + 2)->red + P3->red + (P3 + 1)->red + (P3 + 2)->red) / 9;
		dst->green = (P1->green + (P1 + 1)->green + (P1 + 2)->green + P2->green + (P2 + 1)->green + (P2 + 2)->green + P3->green + (P3 + 1)->green + (P3 + 2)->green) / 9;
		dst->blue = (P1->blue + (P1 + 1)->blue + (P1 + 2)->blue + P2->blue + (P2 + 1)->blue + (P2 + 2)->blue + P3->blue + (P3 + 1)->blue + (P3 + 2)->blue) / 9;
		dst++;       P1++; P2++; P3++;
	}
	//右侧边界处理                              dst->red=(P1->red+(P1+1)->red+P2->red+(P2+1)->red+P3->red+(P3+1)->red)/6;                                  dst->green=(P1->green+(P1+1)->green+P2->green+(P2+1)->green+P3->green+(P3+1)->green)/6;                                 dst->blue=(P1->blue+(P1+1)->blue+P2->blue+(P2+1)->blue+P3->blue+(P3+1)->blue)/6;     
	dst++;      P1 += 2;      P2 += 2;      P3 += 2;
}
//左下角处理              dst->red=(P1->red+(P1+1)->red+P2->red+(P2+1)->red)>>2;     
dst->green = (P1->green + (P1 + 1)->green + P2->green + (P2 + 1)->green) >> 2;
dst->blue = (P1->blue + (P1 + 1)->blue + P2->blue + (P2 + 1)->blue) >> 2;
dst++;
//下边界处理             
for (i = 1; i < dim1; i++) {
	dst->red = (P1->red + (P1 + 1)->red + (P1 + 2)->red + P2->red + (P2 + 1)->red + (P2 + 2)->red) / 6;                          dst->green = (P1->green + (P1 + 1)->green + (P1 + 2)->green + P2->green + (P2 + 1)->green + (P2 + 2)->green) / 6;                          dst->blue = (P1->blue + (P1 + 1)->blue + (P1 + 2)->blue + P2->blue + (P2 + 1)->blue + (P2 + 2)->blue) / 6;
	dst++;      P1++;     P2++;
}
//右下角像素处理             dst->red=(P1->red+(P1+1)->red+P2->red+(P2+1)->red)>>2;              dst->green=(P1->green+(P1+1)->green+P2->green+(P2+1)->green)>>2;             dst->blue=(P1->blue+(P1+1)->blue+P2->blue+(P2+1)->blue)>>2;
//全部处理完毕
}

      5. Experimental Results

 Average speedup of 13.1 times for rotate operations

 Average speedup of 48.4x for smooth operations

Guess you like

Origin blog.csdn.net/qq_35739903/article/details/119653717