Convert Tensor data to sparse matrix

1. Sparse matrix

Commonly used sparse matrix storage formats include COO, CSR /CSC, and LIL

1.COO

COO (Coordinate format) is the simplest format, storing sparse matrices in the form of triples. Record the value of the non-zero element in the matrix and its row and column numbers. The format is (row number, column number, value). The main advantages of this storage method are flexibility and simplicity. But the disadvantage is that matrix correlation operations cannot be directly performed.

2.CSR/CSC

The CSR (Compressed Sparse Row) format implements the CSR format for storing two-dimensional tensors. Although N-dimensional tensors are not supported, the main advantages over the COO format are better storage utilization and faster computational operations. CUDA support does not exist yet

3.LIL

LIL (List-of-List) stores a list per row, with each entry containing a column index and value. Typically, these entries are sorted by column index for faster lookups

4. Processing of sparse matrices

In Pytorch, torch.sparse is an effective tool for processing sparse matrices. Torch supports sparse tensors in the COO (rdinate) format, which can effectively store and process tensors with most zero elements.

2. Convert Tensor data to sparse matrix

1.torch.spares_coo_tenso

torch.spares_coo_tensor(indices, values, siez=None,*, dtype=None, requires_grad=False)->Tensor

parameter:

indices: A 2D tensor in which each column represents the coordinates of a non-zero element.
values: A 1D tensor where each value is indicesthe value corresponding to a non-zero element in the corresponding coordinate.
size: (optional) a tuple representing the size of the sparse tensor

Assume a 2D tensor data

0 0 0 5
0 0 3 0
0 2 0 0

The coordinates (indices) and corresponding values (values) of non-zero elements are:

indices = [[0, 3],
           [1, 2],
           [2, 1]]

values = [5, 3, 2]

torch.sparse_coo_tensorThis sparse tensor can be created using :

import torch

indices = torch.tensor([[0, 1, 2],
                        [3, 2, 1]])
values = torch.tensor([5, 3, 2])
size = (3, 4)

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

2. Convert a 2D Tensor data into a COO sparse tensor

import torch

# 示例张量
tensor = torch.tensor([[0, 2, 1], [0, 0, 3], [4, 0, 0]])

# 寻找非零元素的索引
non_zero_indices = torch.nonzero(tensor).t()
print(non_zero_indices[0])  # tensor([0, 0, 1, 2])
print(non_zero_indices[1])  # tensor([1, 2, 2, 0])

print(tensor.dim())  # 2

# 获取非零元素的值
values = tensor[tuple(non_zero_indices[i] for i in range(tensor.dim()))]

print(values)  # tensor([2, 1, 3, 4])

print(tensor.size())  # torch.Size([3, 3])

# 创建稀疏张量
sparse_tensor = torch.sparse_coo_tensor(non_zero_indices, values, tensor.size())

print(sparse_tensor)

The final sparse matrix obtained

tensor(indices=tensor([[0, 0, 1, 2],
                       [1, 2, 2, 0]]),
       values=tensor([2, 1, 3, 4]),
       size=(3, 3), nnz=4, layout=torch.sparse_coo)

3. Convert a 3Dtensor data to a COO sparse tensor

You can still use torch.sparse_coo_tensorfunctions to convert it to a sparse representation. Similar to 2D tensors, you need to determine the positions of non-zero elements and their values. For a 3D tensor, the coordinates of each non-zero element will be represented by three values

import torch

# 示例3D张量
tensor = torch.tensor([
    [[0, 2, 0], [0, 0, 3], [4, 0, 0]],
    [[0, 0, 0], [0, 5, 0], [0, 0, 6]]
])

print(tensor.dim()) # 3

# 寻找非零元素的索引
non_zero_indices = torch.nonzero(tensor).t()
print(non_zero_indices[0])  # tensor([0, 0, 0, 1, 1])
print(non_zero_indices[1])  # tensor([0, 1, 2, 1, 2])
print(non_zero_indices[2])  # tensor([1, 2, 0, 1, 2])


# 获取非零元素的值
values = tensor[tuple(non_zero_indices[i] for i in range(tensor.dim()))]

# 创建稀疏张量
sparse_tensor = torch.sparse_coo_tensor(non_zero_indices, values, tensor.size())

print(sparse_tensor)

4. Convert a tensor data of unknown dimensions into a COO sparse tensor and store it on the hard disk

"""
	将tensor 数据转换为COO 稀疏张量函数    
"""
def tensor_to_sparse(dense_tensor):
    size = dense_tensor.size()
    # 寻找非零元素的索引
    non_zero_indices = torch.nonzero(dense_tensor).t()
    # 获取非零元素的值
    values = dense_tensor[tuple(non_zero_indices[i] for i in range(dense_tensor.dim()))]
    # 创建稀疏张量
    sparse_tensor = torch.sparse_coo_tensor(non_zero_indices, values, size)

    return sparse_tensor,size
      
# 随机产生一个4D张量数据
dense_tensor = torch.randn((2,3,3,3))
print(dense_tensor)
print(dense_tensor.dim())   # 4

sparse_tensor,size = tensor_to_sparse(dense_tensor)
print(sparse_tensor)
print(size)    # torch.Size([2, 3, 3, 3])
        
# 保存稀疏张量到硬盘
torch.save(sparse_tensor, "spare_tensor.npz")

5. Read the COO sparse tensor stored on the hard disk and convert it to the original tensor data (dense_tensor)

# 从硬盘上加载稀疏张量
loaded_sparse_tensor = torch.load("spare_tensor.npz")
    
"""
	COO 稀疏张量转换为密集张量    
"""
def sparse_to_tensor(loaded_sparse_tensor):
	# 将稀疏张量复原为原始的密集张量
	dense_tensor = loaded_sparse_tensor.to_dense()
        
   	return dense_tensor
 
# 调用函数
sparse_to_tensor(loaded_sparse_tensor)

5. Use the scipy package to complete the above operations

import scipy.sparse
import torch
import scipy.sparse



def tensor_to_sparse(dense_tensor):
    # 将dense_tensor转化为2D
    shape = dense_tensor.shape
    tensor_2d = dense_tensor.view(-1, shape[-1])

    # 将2D tensor转化为numpy array
    array_2d = tensor_2d.numpy()

    # 从numpy array创建sparse matrix
    sparse_matrix = scipy.sparse.coo_matrix(array_2d)

    return sparse_matrix,shape


def sparse_to_tensor(sparse_matrix, original_shape):
    # 从稀疏矩阵转换为2D array
    array_2d = sparse_matrix.toarray()

    # 将2D array转换为original_shape_array
    original_shape_array = array_2d.reshape(original_shape)

    # 将3D array转换为3D tensor
    dense_tensor = torch.from_numpy(original_shape_array)

    return dense_tensor

# 随机产生一个dense_tensor
dense_tensor = torch.randn((2,3,3,3))



# 转化为sparse matrix
sparse_matrix,original_shape = tensor_to_sparse(dense_tensor)
print(original_shape)


# 将sparse matrix保存到硬盘上
scipy.sparse.save_npz('sparse_matrix.npz', sparse_matrix)

# 使用scipy.sparse.load_npz从硬盘加载保存的稀疏张量
loaded_sparse_matrix = scipy.sparse.load_npz('sparse_matrix.npz')


# 稀疏张量复原为原来的tensor数据
restored_tensor = sparse_to_tensor(loaded_sparse_matrix, original_shape)


print(restored_tensor)

# 判断restored_tensor与原来的tensor数据是否一致
print(dense_tensor==restored_tensor)