NVIDIA CUDA Highly Parallel Processor Programming (9): Parallel Mode: Sparse Matrix-Vector Multiplication


Introduction to Compression and Normalization in Parallel Algorithms

background

A sparse matrix is ​​one in which many elements are 0. The figure below is a simple example.

insert image description here

0 elements are not saved when saving a sparse matrix using the sparse row-compressed format. The array data[] holds the non-zero elements. In addition, two auxiliary arrays are needed to preserve the structure of the original matrix. The first is the column index array col_index[ ], which records the longitudinal index of the non-zero elements in the original matrix. The second array row_ptr records the starting position of the non-zero elements of each row of the original matrix in the array data[ ]. The number of elements is one more than the number of rows, and the extra elements are often used as a sign of the end of the row. The CSR storage format of the above matrix is ​​shown in the following figure:

insert image description here
As discussed in (5) , matrices are usually used to solve the shape of A × X + Y = 0 A \times X+Y=0A×X+Y=0 's N-element N-th degree equation system, where A isN × NN \times NN×A matrix of N , X is an N-element vector, and Y is an N-dimensional constant vector. The intuitive solution to this equation isX = A − 1 × ( − Y ) X = A^{-1} \times (-Y)X=A1×( Y ) , you can also use the Gaussian elimination method. Under the condition of using sparse matrix, these two methods become unintuitive, and many non-zero elements are often added to the inverse matrix, making the inverse sparse matrix very large.

Sparse linear equations are generally better to use iterative methods. If the sparse matrix A is a positive definite matrix (that is, for R n R^nRThe non-zero vector x of n has x TA x > 0 x^TAx>0xTAx>0 ), using the conjugate gradient method to iteratively solve the linear equations will get a convergent solution. The process is to initialize a solution X and calculateA × X + YA \times X + YA×X+Y , see if the result is close to 0, otherwise use the gradient vector formula to adjust X, and then use this adjusted X for the next iteration. The Y produced by each iteration participates in the next round of iterative calculation.
insert image description here

Serial implementation A × X + YA \times X + YA×X+The Y code is as follows:

for(int row = 0; row < num_rows; row++){
    
    
	float dot = 0;
	//row_start 与 row_end共同确定属于这一行的data元素的范围
	int row_start = row_ptr[row], row_end = row_ptr[row + 1];
	for(int elem = row_start; elem < row_end; elem++){
    
    
		/*通过elem访问矩阵元素data[elem] 与 该元素的列索引 col_index[elem],
		并通过列索引访问向量x元素。*/
		dot += data[elem] * X[col_index[elem]];  
	}
	y[row] += dot;
}

Parallel SpMV using CSR format

The dot product operation of each row of the sparse matrix has nothing to do with other rows. By assigning each iteration of the outer loop of the above code to a thread, the SPMV kernel is obtained:

__global__ void SpMV_CSR(int num_rows, float *data, int *col_index, int *row_ptr, float *x, float *y)
{
    
    
    int row = threadIdx.x + blockDim.x * blockIdx.x;
    if (row < num_rows)
    {
    
    
        int row_start = row_ptr[row], row_end = row_ptr[row + 1];
        float dot = 0;
        for (int elem = row_start; elem < row_end; elem++)
        {
    
    
            dot += data[elem] * x[col_index[elem]];
        }
        y[row] += dot;
    }
}

This kernel is almost the same as the serial code above and is very simple, but it has two disadvantages:

  1. Memory accesses cannot be combined, and memory bandwidth cannot be effectively utilized.
  2. All warps have control flow branches. The number of times each thread executes the loop is determined by the number of elements assigned to that row, which may differ, causing control flow to branch.

Fill and transpose

The problem of not being able to merge memory fetches and control branches can be solved by data padding and matrix transposition. The ELL storage format uses this idea. According to the CSR format, first determine which line has the most non-zero elements, and then add 0 elements after other non-zero elements, so that their line length is the same as the longest line. The picture on the left of the figure below is the structure of the example sparse matrix used above filled with 0.

Then transpose it, then the element stored in each row is a column of the matrix before the transposition, and all threads access adjacent memory locations each iteration, so that the purpose of combined memory access can be achieved . And because each thread needs to loop through the same number of elements, there will be no problem of control flow branching .

insert image description here
Similarly, col_index should also be filled with transposition in the same way:
insert image description here
the kernel can be designed through the above figure:

__global__ void SpMV_ELL(int num_rows, int total_elems, float *data, int *col_index, float *x, float *y){
    
    
    int row = threadIdx.x + blockIdx.x * blockDim.x;
    if(row < num_rows){
    
    
        float dot = 0;
        for(int i = row; i < num_elems; i+=num_rows){
    
    
            dot += data[i] * y[col_index[i]];
        }
        y[row] = dot;
    }
}

Among them, num_rows is the number of rows in the original matrix, and total_elems is the number of elements in the filled matrix.

Use a hybrid approach to control padding

Too many padding elements in the ELL format arise because there are one or several lines with a large number of non-zero elements. The number of padding elements in the ELL format can be reduced by "taking away" some of the elements in these lines. The coordinate (COO) format can be used for this purpose.
As shown in the figure below, the COO format uses col_index and row_index auxiliary data arrays to store sparse matrices.
insert image description here
The hybrid method is to take some elements out of rows containing a large number of non-zero elements and put them in the COO format. SpMV the remaining elements in CSR or ELL format. Then use SpMV/COO to calculate the elements stored in COO.

Use the ELL format on the device side to calculate and then send it back to the host side, use the serial SpMV/COO to calculate and add it to y to get the correct result. Using serial COO can take advantage of the CPU's high-capacity cache.
Serial COO:

//num_elem 是COO格式中元素的数目
for(int i = 0;i < num_elem;row++)
	y[row_index[i]] += data[i] * x[col_index[i]];

Or directly use the atomic addition operation on the device side to calculate the COO format and add it to y.
Kernel COO:

__global__ void COO_kernel(int num_elem, float * y, float *data, int *row_index, int *col_index){
    
    
	int i = threadIdx.x + blockDim.x * blockId.x;
	if(i < num_elem){
    
    
		atomicAdd(&(y[row_index[i]]), data[i]*x[col_index[i]]);
	}
}

Regularize by sorting and partitioning

After learning this section, I can't hold it anymore (I can't learn it anymore), and I will learn it later.

Guess you like

Origin blog.csdn.net/weixin_45773137/article/details/125572815