3D point cloud semantic segmentation - PointNet

PointNet: Deep Learning on Point Sets for 3D Classifification and Segmentation

Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas
Stanford University

       With the gradual maturity of deep learning in 2D image processing and application, it is also hoped to use powerful deep learning to solve problems such as classification, recognition, segmentation, completion, and registration for 3D point clouds. This article was published in 2017 The article can be regarded as the originator of applying deep learning directly to scattered and disordered 3D point clouds. The network structure is simple and effective (of course effective is only relative). The author verified the classification task on ModelNet40, verified the component segmentation on ShapeNet, and verified the semantic segmentation on S3DIS. This article mainly introduces semantic segmentation ( principle + code ) .

Look directly at the network structure :

The core idea  of ​​the network is two points:

1. Directly use the multi-layer perceptron (MLP) . For example, the first mlp(64,64) refers to a two-layer network, and the number of neurons is 64 and 64 respectively. Why not use convolution, the author explains it like this: In order to share weights and optimize the traditional convolution framework, the input data needs to be in a very regular form, but the point cloud is disordered, and generally needs to be turned into a voxel grid or passed through A series of normalized processing, but this undoubtedly introduces too much human intervention, which may cause damage to the data.

2. Maximum pooling (MaxPooling). Fully considering the disorder, symmetry and invariance of space transformation of the point cloud, the maximum pooling is proposed. Specifically, the article proposes three coping strategies for unordered input like point clouds: 1) direct sorting; 2) use RNN as a sequence to do point clouds, and at the same time do data enhancement (change the arrangement of points in the sequence); 3) Aggregate the information of each point with a symmetric function, such as maximum pooling (symmetry means that no matter what your input order is, the final result will remain the same). Of course, the author verified that the method of maximum pooling is the best through comparative experiments.

Network process :

Input: The input of the entire network is a fixed number of n points, and each point contains xyz three-dimensional coordinate information.

input transform:  It is the small module in the lower left corner of the network structure diagram, which is called Joint Alignment Network in the original text, because the label we put on the point cloud will not change with the space transformation, so we hope that what the network learns is also space The transformation is invariant, so the purpose of this module is to estimate an affine transformation matrix (3x3), and multiply the estimated matrix by the input point cloud before feature extraction (that is, to perform an affine transformation on the input point cloud), the module The middle T-Net is also composed of MLP, and the specific structure is in the code analysis section.

mlp (64, 64): two-layer perceptron, the number of neurons in each layer is 64, 64; shared means that the network structure is fixed, all points must be input into this structure and then get the corresponding output ( Iron-clad battalions and flowing soldiers).

feature transform: The same idea as input transform, but in a higher-dimensional feature space, the estimated affine transformation is also higher-dimensional (64x64). Due to the higher dimensionality of the feature space, optimization is difficult, so a regular term is added when calculating the loss function, so that the solved affine transformation matrix is ​​close to the orthogonal matrix (the original text says that because the orthogonal transformation will not lose the input information). A is the radiation transformation matrix to be estimated by T-Net.

mlp (64, 128, 1024): three-layer perceptron, the number of neurons in each layer is 64, 128, 1024 respectively.

maxpooling: So far, each of the input n points has a 1024-dimensional feature, and then we select the maximum value of the n points (in the dimension of the number of points) to obtain a global feature; if it is a classification task, directly use this feature Input to the next mlp(512,256,k), where k is the number of categories, and get the score of the corresponding category. For the segmentation task, this global feature is copied n times, and spliced ​​together with the features obtained by the previous second layer mlp, as the input of the Segmentation Network.

Segmentation Network: The input is equivalent to splicing the local features and global features of each point, and then passing through two MLPs, finally obtaining an output with a dimension of nxm, that is, performing m classification for each point, and outputting it corresponding to each category. predicted score.

Semantic Segmentation Experiment      
This article mainly introduces the semantic segmentation experiment. The source of the code is GitHub (fork others), which can be accessed here . The code is implemented by pytorch, which also includes other tasks, as well as the code of PointNet++, just pull down the entire project directly. The data set is S3DIS, which can be downloaded here (you need to go online scientifically, and then you need to fill in some contact information to enter Google Cloud to download the data).

The downloaded data file name is Stanford3dDataset_v1.2_Aligned_Version. After downloading, unzip it to data/s3dis/Stanford3dDataset_v1.2_Aligned_Version/

Then here is a brief introduction to the data set . The data set contains 6 indoor areas scanned by Matterport scanners, including a total of 271 rooms. Each point of the scan is semantically annotated, with a total of 13 categories: ceiling, floor, wall, beam, column, window, door , table, chair, sofa, bookcase, board, clutter. In fact, there are only the first 12 categories in the official description, but there are stairs or other things in some rooms, which are classified as clutter.

After decompressing the downloaded compressed file, the data of the room is a txt file, each line represents a point, including six dimensions (XYZRGB), the Anotation folder contains the txt files of the objects contained in this room, and each object is a separate txt file , which also contains XYZRGB data. You can open the txt file with notebook, add "v space" at the beginning of each line in column editing mode, and then change the suffix name to obj (you can also write a small piece of code yourself, and put the code at the end of the article), and then use meshlab to directly Open the obj file:

You can "walk" in and see, this is a conference room with tables, chairs, blackboard, etc., and of course the ceiling and floor:

After viewing the original data, since the pulled file already has a trained model file (saved in log\ sem_seg\ pointnet_sem_seg\checkpoints\best_model.pth), we can directly test it. Before testing, we need to test the original data set Do some processing, put the original data and labels in a file, each line of data is XYZRGBL (L stands for label: 0~12), the specific operation is: enter the data_utils folder, run the collect_indoor3d_data.py file , and it will be in the data \Create a new folder stanford_indoor3d\ below, generate a bunch of .npy files (that is, files in numpy format) under the original data folder, and cut these .npy files to stanford_indoor3d\, (in short, it is to generate. npy file is placed under this folder), and the test code can be run.

Run the test file:

python test_semseg.py --log_dir pointnet_sem_seg --test_area 5 --visual

The test area is No. 5. The generated obj file will be saved in log/sem_seg/pointnet2_sem_seg/visual/ (including ground truth and predict), and it can be opened directly with meshlab. The following is the test result:

avg class IoU: 0.436354

avg class acc: 0.526488

whole scene point accuracy: 0.787821

Open one of the labels and prediction results with meshlab

Detailed code:

Debug the code in the order of model files, data processing files, training files, and test files.

Model files: models\pointnet_utils.py, models\pointnet_sem_seg.py

In pointnet_utils , the STN3d and STNkd modules are first defined , which are the two small networks in the structure diagram:

Taking STN3d as an example, in fact, it is similar to the structure of the backbone network. It is implemented by a series of mlp (using 1*1 one-dimensional convolution, the convolution kernel is equivalent to the weight connection between neurons, and the data itself is neuron nodes) and a fully connected layer. The input of the module is the xyz three-dimensional coordinates, and the output is a 3*3 or k*k affine transformation matrix.

class STN3d(nn.Module):
    def __init__(self, channel):
        super(STN3d, self).__init__()
        self.conv1 = torch.nn.Conv1d(channel, 64, 1)#in_channels,out_channels,kernel_size
        self.conv2 = torch.nn.Conv1d(64, 128, 1)
        self.conv3 = torch.nn.Conv1d(128, 1024, 1)
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 9)#9=3x3
        self.relu = nn.ReLU()

        self.bn1 = nn.BatchNorm1d(64)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(1024)
        self.bn4 = nn.BatchNorm1d(512)
        self.bn5 = nn.BatchNorm1d(256)

    def forward(self, x):
        batchsize = x.size()[0]# 输入维度 Batchsize Channel N-point
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = torch.max(x, 2, keepdim=True)[0]# 在N维度上max
        x = x.view(-1, 1024)

        x = F.relu(self.bn4(self.fc1(x)))
        x = F.relu(self.bn5(self.fc2(x)))
        x = self.fc3(x)

        iden = Variable(torch.from_numpy(np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32))).view(1, 9).repeat(
            batchsize, 1)#tensor不能反向传播,variable可以反向传播。
        if x.is_cuda:
            iden = iden.cuda()
        x = x + iden#更新的过程:在单位矩阵的基础上做调整
        x = x.view(-1, 3, 3)#对每一个batch的所有点估计这个3*3的仿射变换矩阵
        return x

The structure of STNkd is basically the same, except that the size of the final estimated affine transformation matrix is ​​k*k (default: k=64)

Then PointNetEncoder is defined . In this part of the structure diagram, the input is n D-dimensional points, and each point can contain information of other dimensions except xyz; the STN and STNkd modules are called internally and a switch is set to determine whether to use STNkd , there are three outputs, where output1 is divided into two cases: when used for classification, output a global feature of 1*1024, and when used for segmentation, it is necessary to splicing the features obtained by copying the global n times and the second mlp and then outputting it as The input of the segmentation network; output2 is the 3*3 space transformation matrix estimated by the STN network, and output3 is the k*k feature transformation matrix estimated by the STNkd (None when the switch is off).

class PointNetEncoder(nn.Module):
    def __init__(self, global_feat=True, feature_transform=False, channel=3):
        '''global_feat=True表明是分类 分割时为False
           feature_transform=True表明需要对特征空间做变换,相当于STNkd模块的开关
        '''
        super(PointNetEncoder, self).__init__()
        self.stn = STN3d(channel)
        self.conv1 = torch.nn.Conv1d(channel, 64, 1)
        self.conv2 = torch.nn.Conv1d(64, 128, 1)
        self.conv3 = torch.nn.Conv1d(128, 1024, 1)
        self.bn1 = nn.BatchNorm1d(64)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(1024)
        self.global_feat = global_feat
        self.feature_transform = feature_transform
        if self.feature_transform:
            self.fstn = STNkd(k=64)

    def forward(self, x):
        B, D, N = x.size()#第二个维度叫dimension或者channel都行
        trans = self.stn(x)
        x = x.transpose(2, 1)#B C N->B N C 把通道维度放最后
        if D > 3:#如果输入大于三维,只处理前三维:XYZ(stn对坐标进行变换)
            feature = x[:, :, 3:]#特征从第四维开始
            x = x[:, :, :3]
        x = torch.bmm(x, trans)#tensor*matrix  stn网络的目的就是估计一个空间变换 然后对x实施这个空间变换
        if D > 3:
            x = torch.cat([x, feature], dim=2)#处理完之后又把特征拼回来
        x = x.transpose(2, 1)#处理完了把通道又换回B C N
        x = F.relu(self.bn1(self.conv1(x)))

        if self.feature_transform:#如果STNkd的开关打开:
            trans_feat = self.fstn(x)
            x = x.transpose(2, 1)#每次都得把Channel换到最后
            x = torch.bmm(x, trans_feat)#对特征空间估计的变换矩阵
            x = x.transpose(2, 1)#然后又换回来
        else:#否则不执行变换
            trans_feat = None

        pointfeat = x#经过第二个mlp后的特征,用于拼接
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.bn3(self.conv3(x))
        x = torch.max(x, 2, keepdim=True)[0]
        x = x.view(-1, 1024)#结构图中maxpooling后的global_feature
        if self.global_feat:#如果是分类,直接返回x,即global_feature
            return x, trans, trans_feat  #trans:对3d空间估计的变换矩阵  trans_feat:对特征空间估计的变换矩阵
        else:#如果是分割,x需要在N-point的维度上复制N次(然后和前面经过第二个mlp后得到的特征做拼接,作为segmentation网络的输入)
            x = x.view(-1, 1024, 1).repeat(1, 1, N)
            return torch.cat([x, pointfeat], 1), trans, trans_feat

At the end of the pointnet_utils.py file, a function is defined: feature_transform_reguliarzer. When implementing the alignment of the feature space mentioned above in the network process section, regular constraints need to be added to make the solved matrix close to an orthogonal matrix, that is, to achieve this Formula, A is the estimated k*k matrix (function input). The module returns this regularization item, which will be used later when calculating loss.

def feature_transform_reguliarzer(trans):#对齐特征的时候,由于特征空间维度更高,优化难度大,所以加了一项正则项,让求解出来的仿射变换矩阵接近于正交
    d = trans.size()[1]#通道数
    I = torch.eye(d)[None, :, :]#增加维度
    if trans.is_cuda:
        I = I.cuda()
    loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2, 1)) - I, dim=(1, 2)))
    return loss

Next, look at the pointnet_sem_seg file, which describes the entire network structure used in the segmentation task. Import the two modules PointNetEncoder and feature_transform_reguliarzer defined in the previous file , and define the two classes get_model and get_loss.

class get_model(nn.Module):
    def __init__(self, num_class):
        super(get_model, self).__init__()
        self.k = num_class
        self.feat = PointNetEncoder(global_feat=False, feature_transform=True, channel=9)#输入为9D数据 xyzrgb、以及点在房间中相对位置(归一化到0~1之间) gloabal_feat=False表明是分割
        self.conv1 = torch.nn.Conv1d(1088, 512, 1)
        self.conv2 = torch.nn.Conv1d(512, 256, 1)
        self.conv3 = torch.nn.Conv1d(256, 128, 1)
        self.conv4 = torch.nn.Conv1d(128, self.k, 1)
        self.bn1 = nn.BatchNorm1d(512)
        self.bn2 = nn.BatchNorm1d(256)
        self.bn3 = nn.BatchNorm1d(128)

    def forward(self, x):
        batchsize = x.size()[0]
        n_pts = x.size()[2] #B C N
        x, trans, trans_feat = self.feat(x)#trans:3*3; trans_feat:64*64
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.conv4(x)#n*m
        x = x.transpose(2,1).contiguous() #深拷贝,断开和原变量的依赖 https://blog.csdn.net/kdongyi/article/details/108180250
        x = F.log_softmax(x.view(-1,self.k), dim=-1)#虽然在数学上等价于log(softmax(x)),但做这两个单独操作速度较慢,数值上也不稳定。这个函数使用另一种公式来正确计算输出和梯度。
        x = x.view(batchsize, n_pts, self.k)#n*m 每个点输出m分类的分数
        return x, trans_feat#返回transfeat的目的是计算损失时要对它实现前面提到的正则化约束 即公式中的A

class get_loss(torch.nn.Module):
    def __init__(self, mat_diff_loss_scale=0.001):#相当于调整正则化力度的因子
        super(get_loss, self).__init__()
        self.mat_diff_loss_scale = mat_diff_loss_scale

    def forward(self, pred, target, trans_feat, weight):
        loss = F.nll_loss(pred, target, weight = weight)# (negative log likelihood loss) NLLLoss 的 输入 是一个对数概率向量和一个目标标签. 它不会为我们计算对数概率. 适合网络的最后一层是log_softmax. 损失函数 nn.CrossEntropyLoss() 与 NLLLoss() 相同, 唯一的不同是它为我们去做 softmax.
                                                        #本质:CrossEntropyLoss()=log_softmax() + NLLLoss()
        mat_diff_loss = feature_transform_reguliarzer(trans_feat)
        total_loss = loss + mat_diff_loss * self.mat_diff_loss_scale#对齐特征时候的正则化约束
        return total_loss

The __main__ function is written in this file, so you can run it directly to view the model summary

if __name__ == '__main__':
    model = get_model(13)
    xyz = torch.rand(12, 9, 2048)#B C=3 N 注意代码中定义的输入维度为9,所以这里要把原来的3改为9
    (model(xyz))
    print(model)

 output:

get_model(
  (feat): PointNetEncoder(
    (stn): STN3d(
      (conv1): Conv1d(9, 64, kernel_size=(1,), stride=(1,))
      (conv2): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
      (conv3): Conv1d(128, 1024, kernel_size=(1,), stride=(1,))
      (fc1): Linear(in_features=1024, out_features=512, bias=True)
      (fc2): Linear(in_features=512, out_features=256, bias=True)
      (fc3): Linear(in_features=256, out_features=9, bias=True)
      (relu): ReLU()
      (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn3): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (conv1): Conv1d(9, 64, kernel_size=(1,), stride=(1,))
    (conv2): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
    (conv3): Conv1d(128, 1024, kernel_size=(1,), stride=(1,))
    (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn3): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (fstn): STNkd(
      (conv1): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
      (conv2): Conv1d(64, 128, kernel_size=(1,), stride=(1,))
      (conv3): Conv1d(128, 1024, kernel_size=(1,), stride=(1,))
      (fc1): Linear(in_features=1024, out_features=512, bias=True)
      (fc2): Linear(in_features=512, out_features=256, bias=True)
      (fc3): Linear(in_features=256, out_features=4096, bias=True)
      (relu): ReLU()
      (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn3): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (bn5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (conv1): Conv1d(1088, 512, kernel_size=(1,), stride=(1,))
  (conv2): Conv1d(512, 256, kernel_size=(1,), stride=(1,))
  (conv3): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
  (conv4): Conv1d(128, 13, kernel_size=(1,), stride=(1,))
  (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

 Data processing files: data_utils\S3DISDataLoader.py, data_utils\indoor3d_util.py, data_utils\ collect_indoor3d_data.py

These two files , indoor3d_util.py and collect_indoor3d_data.py, are used to process the downloaded raw data, and integrate labels and data into .npy files. Each row in the file is XYZRGB and label index. Indoord_util.py defines a lot of tools, a bit like a toolkit, and defines some file paths. collect_indoor3d_data.py uses the collect_point_label tool. Running the collect_indoor3d_data.py file will generate the .npy (numpy) file required for the experiment in data/stanford_indoor3d.

Focus on the S3DISDataLoader.py file. First, the loader S3DISDataset is defined, which inherits torch.utils.data.Dataset. To implement a custom DataLoader in pytorch, at least two class methods, __getitem__ and __getlen__, must be implemented, which are passed in the init method The parameter also includes the test area (the default is 5), and the size of the block (during training, 4096 points are randomly collected in the 1m*1m block as the network input).

class S3DISDataset(Dataset):
    def __init__(self, split='train', data_root='trainval_fullarea', num_point=4096, test_area=5, block_size=1.0, sample_rate=1.0, transform=None):
        super().__init__()
        self.num_point = num_point
        self.block_size = block_size
        self.transform = transform
        rooms = sorted(os.listdir(data_root))
        rooms = [room for room in rooms if 'Area_' in room]#271个.npy文件
        if split == 'train':
            rooms_split = [room for room in rooms if not 'Area_{}'.format(test_area) in room]#除了5号区域之外的其他区域
        else:
            rooms_split = [room for room in rooms if 'Area_{}'.format(test_area) in room]#5号区域

        self.room_points, self.room_labels = [], []
        self.room_coord_min, self.room_coord_max = [], []
        num_point_all = []
        labelweights = np.zeros(13)#初始化每一类的权重[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

        for room_name in tqdm(rooms_split, total=len(rooms_split)):#tqdm传入一个可迭代的对象,制作进度条 eg:room_name:'Area_1_WC_1.npy'
            room_path = os.path.join(data_root, room_name)#eg:'data/s3dis/stanford_indoor3d/Area_1_WC_1.npy'
            room_data = np.load(room_path)  # xyzrgbl, N*7
            points, labels = room_data[:, 0:6], room_data[:, 6]  # xyzrgb, N*6; l, N
            tmp, _ = np.histogram(labels, range(14))#对输入的数据,也就是当前房间做直方图统计,返回每一类的点数,eg:[192039 185764 488740      0      0      0  28008      0      0      0,      0      0 218382]
            labelweights += tmp
            coord_min, coord_max = np.amin(points, axis=0)[:3], np.amax(points, axis=0)[:3]#xyz三个维度分别求最大最小值,
            self.room_points.append(points), self.room_labels.append(labels)
            self.room_coord_min.append(coord_min), self.room_coord_max.append(coord_max)
            num_point_all.append(labels.size)#每个房间的总点数 比如训练时为204个房间 list:204
        labelweights = labelweights.astype(np.float32)
        labelweights = labelweights / np.sum(labelweights)
        self.labelweights = np.power(np.amax(labelweights) / labelweights, 1 / 3.0)#(倒数*最大值)^(1/3)
        print(self.labelweights)#统计所有训练数据的各类权重eg:[1.124833  1.1816078 1.  2.2412012 2.340336  2.343587  1.7070498 2.0335796 1.8852289 3.8252103 1.7948895 2.7857335 1.3452303]
        sample_prob = num_point_all / np.sum(num_point_all)#每个房间的采样概率list:204:每个房间的点数/所有房间点总数
        num_iter = int(np.sum(num_point_all) * sample_rate / num_point)#迭代次数:所有房间总点数*采样率(1代表全采)/4096 总共要迭代这么多次才能把所有房间采完
        room_idxs = []
        for index in range(len(rooms_split)):
            room_idxs.extend([index] * int(round(sample_prob[index] * num_iter)))#不断往后面添加元素,添加元素为索引0~203,每次循环添加个数为:迭代次数*该索引房间的采样率。元素(索引0~204)的个数就是对应索引的房间采样的次数
        self.room_idxs = np.array(room_idxs)#每次采样都是按照room_idxs中的元素选房间来采,按比例分配的思想,不是随机采或者大家一样
        print("Totally {} samples in {} set.".format(len(self.room_idxs), split))#样本数

    def __getitem__(self, idx):
        room_idx = self.room_idxs[idx]
        points = self.room_points[room_idx]#该索引房间内的所有点   # N * 6
        labels = self.room_labels[room_idx] #该索引房间内所有点的标签  # N
        N_points = points.shape[0]

        while (True):
            center = points[np.random.choice(N_points)][:3]#随机选择一个点坐标作为block中心
            block_min = center - [self.block_size / 2.0, self.block_size / 2.0, 0]#block的区域范围:三个坐标的最小值
            block_max = center + [self.block_size / 2.0, self.block_size / 2.0, 0]#block的区域范围:三个坐标的最大值
            point_idxs = np.where((points[:, 0] >= block_min[0]) & (points[:, 0] <= block_max[0]) & (points[:, 1] >= block_min[1]) & (points[:, 1] <= block_max[1]))[0]#在block范围内的点的索引
            if point_idxs.size > 1024:#直到采集的点超过1024个,否则就再随机采
                break

        if point_idxs.size >= self.num_point:#如果采的点比4096多就再选4096,少就复制一些点凑满4096
            selected_point_idxs = np.random.choice(point_idxs, self.num_point, replace=False)#就在这些点里面再随机选4096个
        else:
            selected_point_idxs = np.random.choice(point_idxs, self.num_point, replace=True)#replace:True表示可以取相同数字,False表示不可以取相同数字

        # normalize
        selected_points = points[selected_point_idxs, :]  # num_point * 6
        current_points = np.zeros((self.num_point, 9))  # num_point * 9
        current_points[:, 6] = selected_points[:, 0] / self.room_coord_max[room_idx][0]#论文中提到的输入为9维,最后三维为该点在房间中的相对位置
        current_points[:, 7] = selected_points[:, 1] / self.room_coord_max[room_idx][1]
        current_points[:, 8] = selected_points[:, 2] / self.room_coord_max[room_idx][2]
        selected_points[:, 0] = selected_points[:, 0] - center[0]#对x,y去中心化,但z没有
        selected_points[:, 1] = selected_points[:, 1] - center[1]
        selected_points[:, 3:6] /= 255.0#RGB归一化
        current_points[:, 0:6] = selected_points
        current_labels = labels[selected_point_idxs]
        if self.transform is not None:
            current_points, current_labels = self.transform(current_points, current_labels)
        return current_points, current_labels#返回4096个点和对应的标签

    def __len__(self):
        return len(self.room_idxs)

This file also defines ScannetDatasetWholeScene(), which will be read when explaining the test file.

Training file: train_semseg.py , using the rotate_point_cloud_z() in provider.py for data enhancement (random rotation along the z-axis). There are very routine operations in the code: set user parameters, create folders, set hyperparameters and parameter automatic adjustment schemes, load dataset to generate dataloader, load model, then train, eval, model save (save the latest model model.pth and The model with the highest IoU (best_model.pth), print information.

Test file: test_semseg.py

Since the test does not randomly sample blocks like training, but needs to input the entire scene into the network, so ScannetDatasetWholeScene() defined in S3DISDataLoader.py is used to make data. Specifically, a room is gridded according to a given step size, and then there are overlapping moving blocks to sample points. As in training, if the points in the block are less than 4096, some points are repeatedly sampled. In this way, there are generally several small batches inside each block, and each batch is input into the network to predict and save the corresponding prediction score, and finally calculate the IOU, and associate the category information of each point with the color information of the semantic label. Then write to the file together.

S3DISDataLoader.py:

class ScannetDatasetWholeScene():
    # prepare to give prediction on each points
    def __init__(self, root, block_points=4096, split='test', test_area=5, stride=0.5, block_size=1.0, padding=0.001):
        self.block_points = block_points
        self.block_size = block_size
        self.padding = padding
        self.root = root
        self.split = split
        self.stride = stride
        self.scene_points_num = []
        assert split in ['train', 'test']
        if self.split == 'train':
            self.file_list = [d for d in os.listdir(root) if d.find('Area_%d' % test_area) is -1]
        else:
            self.file_list = [d for d in os.listdir(root) if d.find('Area_%d' % test_area) is not -1]
        self.scene_points_list = []
        self.semantic_labels_list = []
        self.room_coord_min, self.room_coord_max = [], []
        for file in self.file_list:
            data = np.load(root + file)#加载的是一个房间内所有点
            points = data[:, :3]
            self.scene_points_list.append(data[:, :6])
            self.semantic_labels_list.append(data[:, 6])
            coord_min, coord_max = np.amin(points, axis=0)[:3], np.amax(points, axis=0)[:3]
            self.room_coord_min.append(coord_min), self.room_coord_max.append(coord_max)
        assert len(self.scene_points_list) == len(self.semantic_labels_list)

        labelweights = np.zeros(13)
        for seg in self.semantic_labels_list:
            tmp, _ = np.histogram(seg, range(14))
            self.scene_points_num.append(seg.shape[0])
            labelweights += tmp
        labelweights = labelweights.astype(np.float32)
        labelweights = labelweights / np.sum(labelweights)
        self.labelweights = np.power(np.amax(labelweights) / labelweights, 1 / 3.0)

    def __getitem__(self, index):
        point_set_ini = self.scene_points_list[index]#取出一个房间中的点
        points = point_set_ini[:,:6]
        labels = self.semantic_labels_list[index]
        coord_min, coord_max = np.amin(points, axis=0)[:3], np.amax(points, axis=0)[:3]
        grid_x = int(np.ceil(float(coord_max[0] - coord_min[0] - self.block_size) / self.stride) + 1)#把一个房间分成网格
        grid_y = int(np.ceil(float(coord_max[1] - coord_min[1] - self.block_size) / self.stride) + 1)
        data_room, label_room, sample_weight, index_room = np.array([]), np.array([]), np.array([]),  np.array([])
        for index_y in range(0, grid_y):
            for index_x in range(0, grid_x):
                s_x = coord_min[0] + index_x * self.stride#start
                e_x = min(s_x + self.block_size, coord_max[0])#end 如果超出边界就取边界
                s_x = e_x - self.block_size#移动完后归位,准备下一次步长移动,
                s_y = coord_min[1] + index_y * self.stride
                e_y = min(s_y + self.block_size, coord_max[1])
                s_y = e_y - self.block_size
                point_idxs = np.where(
                    (points[:, 0] >= s_x - self.padding) & (points[:, 0] <= e_x + self.padding) & (points[:, 1] >= s_y - self.padding) & (
                                points[:, 1] <= e_y + self.padding))[0]#一个block中的点其实还挺多的,eg:69678
                if point_idxs.size == 0:#如果恰好这个block里面一个点也没有,就接着移动
                    continue
                num_batch = int(np.ceil(point_idxs.size / self.block_points))#eg:18
                point_size = int(num_batch * self.block_points)#这个操作把点数转为4096的整数倍eg:73728
                replace = False if (point_size - point_idxs.size <= point_idxs.size) else True#point_size是point_idxs.size向上最接近的4096的整数倍,point_size-point_idxs.size是向上取整需要补足的点数,如果采样的点不够补足就需要重复采一些点,实际上只有采样点不足2048的时候才需要
                point_idxs_repeat = np.random.choice(point_idxs, point_size - point_idxs.size, replace=replace)#随机采样补足向上取整的那部分
                point_idxs = np.concatenate((point_idxs, point_idxs_repeat))#eg:73728
                np.random.shuffle(point_idxs)
                data_batch = points[point_idxs, :]#73728*6
                normlized_xyz = np.zeros((point_size, 3))
                #点在这个房间中的相对位置(0~1)之间 xyz
                normlized_xyz[:, 0] = data_batch[:, 0] / coord_max[0]
                normlized_xyz[:, 1] = data_batch[:, 1] / coord_max[1]
                normlized_xyz[:, 2] = data_batch[:, 2] / coord_max[2]
                data_batch[:, 0] = data_batch[:, 0] - (s_x + self.block_size / 2.0)#变换到局部坐标
                data_batch[:, 1] = data_batch[:, 1] - (s_y + self.block_size / 2.0)
                data_batch[:, 3:6] /= 255.0
                data_batch = np.concatenate((data_batch, normlized_xyz), axis=1)#XYZRGBxyz
                label_batch = labels[point_idxs].astype(int)
                batch_weight = self.labelweights[label_batch]

                data_room = np.vstack([data_room, data_batch]) if data_room.size else data_batch#在batch维度上进行拼接
                label_room = np.hstack([label_room, label_batch]) if label_room.size else label_batch
                sample_weight = np.hstack([sample_weight, batch_weight]) if label_room.size else batch_weight
                index_room = np.hstack([index_room, point_idxs]) if index_room.size else point_idxs#block里的点在房间中的索引
        data_room = data_room.reshape((-1, self.block_points, data_room.shape[1]))#网格循环结束之后,里面存储的就是每一次移动的block里面的点,是4096的整数倍,是多少倍一个block就有多少个batch
        label_room = label_room.reshape((-1, self.block_points))
        sample_weight = sample_weight.reshape((-1, self.block_points))
        index_room = index_room.reshape((-1, self.block_points))
        return data_room, label_room, sample_weight, index_room

    def __len__(self):
        return len(self.scene_points_list)#返回房间数67

test_semseg.py:

BASE_DIR = os.path.dirname(os.path.abspath(__file__))
ROOT_DIR = BASE_DIR
sys.path.append(os.path.join(ROOT_DIR, 'models'))

classes = ['ceiling', 'floor', 'wall', 'beam', 'column', 'window', 'door', 'table', 'chair', 'sofa', 'bookcase',
           'board', 'clutter']
class2label = {cls: i for i, cls in enumerate(classes)}
seg_classes = class2label#{类名:索引}
seg_label_to_cat = {}#{索引:类名}
for i, cat in enumerate(seg_classes.keys()):
    seg_label_to_cat[i] = cat


def parse_args():
    '''PARAMETERS'''
    parser = argparse.ArgumentParser('Model')
    parser.add_argument('--batch_size', type=int, default=1, help='batch size in testing [default: 32]')
    parser.add_argument('--gpu', type=str, default='0', help='specify gpu device')
    parser.add_argument('--num_point', type=int, default=4096, help='point number [default: 4096]')
    parser.add_argument('--log_dir', type=str, required=True, help='experiment root')
    parser.add_argument('--visual', action='store_true', default=False, help='visualize result [default: False]')
    parser.add_argument('--test_area', type=int, default=5, help='area for testing, option: 1-6 [default: 5]')
    parser.add_argument('--num_votes', type=int, default=3, help='aggregate segmentation scores with voting [default: 5]')
    return parser.parse_args()


def add_vote(vote_label_pool, point_idx, pred_label, weight):
    B = pred_label.shape[0]
    N = pred_label.shape[1]#4096 每一个block里面的小batch
    for b in range(B):
        for n in range(N):# 遍历每一个点
            if weight[b, n] != 0 and not np.isinf(weight[b, n]):
                vote_label_pool[int(point_idx[b, n]), int(pred_label[b, n])] += 1#每一个点归类到13个类别的统计
    return vote_label_pool


def main(args):
    def log_string(str):
        logger.info(str)
        print(str)

    '''HYPER PARAMETER'''
    os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu
    experiment_dir = 'log/sem_seg/' + args.log_dir
    visual_dir = experiment_dir + '/visual/'
    visual_dir = Path(visual_dir)
    visual_dir.mkdir(exist_ok=True)

    '''LOG'''
    args = parse_args()
    logger = logging.getLogger("Model")
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    file_handler = logging.FileHandler('%s/eval.txt' % experiment_dir)
    file_handler.setLevel(logging.INFO)
    file_handler.setFormatter(formatter)
    logger.addHandler(file_handler)
    log_string('PARAMETER ...')
    log_string(args)

    NUM_CLASSES = 13
    BATCH_SIZE = args.batch_size
    NUM_POINT = args.num_point

    root = 'data/s3dis/stanford_indoor3d/'

    TEST_DATASET_WHOLE_SCENE = ScannetDatasetWholeScene(root, split='test', test_area=args.test_area, block_points=NUM_POINT)
    log_string("The number of test data is: %d" % len(TEST_DATASET_WHOLE_SCENE))

    '''MODEL LOADING'''#加载模型和预训练文件
    model_name = os.listdir(experiment_dir + '/logs')[0].split('.')[0]
    MODEL = importlib.import_module(model_name)
    classifier = MODEL.get_model(NUM_CLASSES).cuda()
    checkpoint = torch.load(str(experiment_dir) + '/checkpoints/best_model.pth')
    classifier.load_state_dict(checkpoint['model_state_dict'])
    classifier = classifier.eval()

    with torch.no_grad():
        scene_id = TEST_DATASET_WHOLE_SCENE.file_list#67个房间.npy文件
        scene_id = [x[:-4] for x in scene_id]
        num_batches = len(TEST_DATASET_WHOLE_SCENE)#67个房间

        total_seen_class = [0 for _ in range(NUM_CLASSES)]#[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        total_correct_class = [0 for _ in range(NUM_CLASSES)]
        total_iou_deno_class = [0 for _ in range(NUM_CLASSES)]

        log_string('---- EVALUATION WHOLE SCENE----')

        for batch_idx in range(num_batches):
            print("Inference [%d/%d] %s ..." % (batch_idx + 1, num_batches, scene_id[batch_idx]))
            total_seen_class_tmp = [0 for _ in range(NUM_CLASSES)]
            total_correct_class_tmp = [0 for _ in range(NUM_CLASSES)]
            total_iou_deno_class_tmp = [0 for _ in range(NUM_CLASSES)]
            if args.visual:
                fout = open(os.path.join(visual_dir, scene_id[batch_idx] + '_pred.obj'), 'w')
                fout_gt = open(os.path.join(visual_dir, scene_id[batch_idx] + '_gt.obj'), 'w')

            whole_scene_data = TEST_DATASET_WHOLE_SCENE.scene_points_list[batch_idx]
            whole_scene_label = TEST_DATASET_WHOLE_SCENE.semantic_labels_list[batch_idx]
            vote_label_pool = np.zeros((whole_scene_label.shape[0], NUM_CLASSES))#N*13每个点都有13个分数
            for _ in tqdm(range(args.num_votes), total=args.num_votes):
                scene_data, scene_label, scene_smpw, scene_point_index = TEST_DATASET_WHOLE_SCENE[batch_idx]
                num_blocks = scene_data.shape[0]
                s_batch_num = (num_blocks + BATCH_SIZE - 1) // BATCH_SIZE#相当于向上找离num_blocks最近的BATCH_SIZE最近的整数倍
                batch_data = np.zeros((BATCH_SIZE, NUM_POINT, 9))

                batch_label = np.zeros((BATCH_SIZE, NUM_POINT))
                batch_point_index = np.zeros((BATCH_SIZE, NUM_POINT))
                batch_smpw = np.zeros((BATCH_SIZE, NUM_POINT))

                for sbatch in range(s_batch_num):
                    start_idx = sbatch * BATCH_SIZE
                    end_idx = min((sbatch + 1) * BATCH_SIZE, num_blocks)
                    real_batch_size = end_idx - start_idx
                    batch_data[0:real_batch_size, ...] = scene_data[start_idx:end_idx, ...]
                    batch_label[0:real_batch_size, ...] = scene_label[start_idx:end_idx, ...]
                    batch_point_index[0:real_batch_size, ...] = scene_point_index[start_idx:end_idx, ...]
                    batch_smpw[0:real_batch_size, ...] = scene_smpw[start_idx:end_idx, ...]
                    batch_data[:, :, 3:6] /= 1.0

                    torch_data = torch.Tensor(batch_data)
                    torch_data = torch_data.float().cuda()
                    torch_data = torch_data.transpose(2, 1)
                    seg_pred, _ = classifier(torch_data)
                    batch_pred_label = seg_pred.contiguous().cpu().data.max(2)[1].numpy()

                    vote_label_pool = add_vote(vote_label_pool, batch_point_index[0:real_batch_size, ...],
                                               batch_pred_label[0:real_batch_size, ...],
                                               batch_smpw[0:real_batch_size, ...])

            pred_label = np.argmax(vote_label_pool, 1)
            # 统计测试结果
            for l in range(NUM_CLASSES):
                total_seen_class_tmp[l] += np.sum((whole_scene_label == l))
                total_correct_class_tmp[l] += np.sum((pred_label == l) & (whole_scene_label == l))
                total_iou_deno_class_tmp[l] += np.sum(((pred_label == l) | (whole_scene_label == l)))
                total_seen_class[l] += total_seen_class_tmp[l]
                total_correct_class[l] += total_correct_class_tmp[l]
                total_iou_deno_class[l] += total_iou_deno_class_tmp[l]

            iou_map = np.array(total_correct_class_tmp) / (np.array(total_iou_deno_class_tmp, dtype=np.float) + 1e-6)
            print(iou_map)
            arr = np.array(total_seen_class_tmp)
            tmp_iou = np.mean(iou_map[arr != 0])
            log_string('Mean IoU of %s: %.4f' % (scene_id[batch_idx], tmp_iou))
            print('----------------------------')
            #写入文件
            filename = os.path.join(visual_dir, scene_id[batch_idx] + '.txt')
            with open(filename, 'w') as pl_save:
                for i in pred_label:
                    pl_save.write(str(int(i)) + '\n')
                pl_save.close()
            for i in range(whole_scene_label.shape[0]):
                color = g_label2color[pred_label[i]]
                color_gt = g_label2color[whole_scene_label[i]]
                if args.visual:
                    fout.write('v %f %f %f %d %d %d\n' % (#每行前面加‘V’在obj格式中表示顶点
                        whole_scene_data[i, 0], whole_scene_data[i, 1], whole_scene_data[i, 2], color[0], color[1],
                        color[2]))
                    fout_gt.write(
                        'v %f %f %f %d %d %d\n' % (
                            whole_scene_data[i, 0], whole_scene_data[i, 1], whole_scene_data[i, 2], color_gt[0],
                            color_gt[1], color_gt[2]))
            if args.visual:
                fout.close()
                fout_gt.close()

        IoU = np.array(total_correct_class) / (np.array(total_iou_deno_class, dtype=np.float) + 1e-6)
        iou_per_class_str = '------- IoU --------\n'
        for l in range(NUM_CLASSES):
            iou_per_class_str += 'class %s, IoU: %.3f \n' % (
                seg_label_to_cat[l] + ' ' * (14 - len(seg_label_to_cat[l])),
                total_correct_class[l] / float(total_iou_deno_class[l]))
        log_string(iou_per_class_str)
        log_string('eval point avg class IoU: %f' % np.mean(IoU))
        log_string('eval whole scene point avg class acc: %f' % (
            np.mean(np.array(total_correct_class) / (np.array(total_seen_class, dtype=np.float) + 1e-6))))
        log_string('eval whole scene point accuracy: %f' % (
                np.sum(total_correct_class) / float(np.sum(total_seen_class) + 1e-6)))

        print("Done!")

The model continues to train

Use the currently trained model to continue training. It should be noted that the epoch of the given best_model has reached 110, so the value of epoch needs to be set in the train_sem_seg.py file (greater than 110), and here it is set to 140 (retraining for 30 rounds) ).

def parse_args():
    parser = argparse.ArgumentParser('Model')
    parser.add_argument('--model', type=str, default='pointnet_sem_seg', help='model name [default: pointnet_sem_seg]')
    parser.add_argument('--batch_size', type=int, default=16, help='Batch Size during training [default: 16]')
    parser.add_argument('--epoch', default=140, type=int, help='Epoch to run [default: 32]')
    parser.add_argument('--learning_rate', default=0.001, type=float, help='Initial learning rate [default: 0.001]')
    parser.add_argument('--gpu', type=str, default='0', help='GPU to use [default: GPU 0]')
    parser.add_argument('--optimizer', type=str, default='Adam', help='Adam or SGD [default: Adam]')
    parser.add_argument('--log_dir', type=str, default='pointnet_sem_seg' , help='Log path [default: None]')
    parser.add_argument('--decay_rate', type=float, default=1e-4, help='weight decay [default: 1e-4]')
    parser.add_argument('--npoint', type=int, default=4096, help='Point Number [default: 4096]')
    parser.add_argument('--step_size', type=int, default=10, help='Decay step for lr decay [default: every 10 epochs]')
    parser.add_argument('--lr_decay', type=float, default=0.7, help='Decay rate for lr decay [default: 0.7]')
    parser.add_argument('--test_area', type=int, default=5, help='Which area to use for test, option: 1-6 [default: 5]')

    return parser.parse_args()

Since the experiment is conducted under win10, the num_worker parameter in the dataloader needs to be changed to 0, otherwise an error will be reported:

trainDataLoader = torch.utils.data.DataLoader(TRAIN_DATASET, batch_size=BATCH_SIZE, shuffle=True, num_workers=0,
                                                  pin_memory=True, drop_last=True,
                                                  worker_init_fn=lambda x: np.random.seed(x + int(time.time())))
testDataLoader = torch.utils.data.DataLoader(TEST_DATASET, batch_size=BATCH_SIZE, shuffle=False, num_workers=0,
                                                 pin_memory=True, drop_last=True)

Set the running parameters in the IDE:

--model pointnet_sem_seg --test_area 5 --log_dir pointnet_sem_seg

On my spicy chicken graphics card, the training is still time-consuming. Basically, one epoch can be completed in an hour, so I stop it when it reaches 122, and do another test after the training:

eval point avg class IoU: 0.437896

eval whole scene point avg class acc: 0.531335

eval whole scene point accuracy: 0.784970

Supplement: Python code to construct obj from txt 

ff = open('./1.obj','w')#自己先创建一个空的obj文件
with open('./conferenceRoom_1.txt','r') as f:#打开数据集里面的任意一个房间的txt文件,
    line = f.readlines()#读取每一行
    for line_list in line:
        line_new = 'v '+line_list
        ff.write(line_new)

In fact, add a "v space" in front of each line, and then change the suffix name to obj to open it directly with meshlab.

Guess you like

Origin blog.csdn.net/weixin_42371376/article/details/118142529