Self introduction & project introduction

Self introduction:

  My name is LSH, and my hometown is JN City, SD Province. I studied mechanical engineering at DB University as an undergraduate. During my undergraduate studies, I participated in many national and school-level competitions. During my undergraduate studies, I participated in the Daiso project. The topic was tilt-rotor drones. Mainly completed the work related to the structural design of drones. The four-year academic performance of the university has also been maintained in the top five percent. The senior year also successfully won the honor of provincial outstanding graduates, and successfully recommended to waive the recommendation to Xi'an He is studying for a master's degree at Jiaotong University.
  Now I am a second-year graduate student in mechanical engineering at XAJT University. The main research direction of the postgraduate is robot grasping technology. The main technologies involved include 3D point cloud segmentation, attitude estimation, and path planning. I also participated in two corporate internships during the postgraduate stage. Among them, the internship content in Xi'an Zhixiang Optoelectronics is also a project related to robot grasping. The internship time in Hikvision Research Institute is relatively short. He is an algorithm engineer and is mainly responsible for the testing and optimization of algorithm models. The above is a short self-introduction.

  My name is Li Shenghao. And my hometown is Jining City, Shandong Province. I studied at Northeast University as an undergraduate. During my undergraduate period, I participated in national and university level competitions for many times and got good grades. During my undergraduate period, my academic record has been in the top five percent. In my senior year, I also successfully won the title of excellent graduate in Liaoning Province, and was successfully promoted to Xi’an Jiaotong University to study for a master’s degree.
  The main research topic of my postgraduate stage is robot graspping, which mainly involves three aspects: point cloud segmentation, pose estimation and path planning. I also participated in two enterprise internships in the postgraduate stage, and my research topic is also related to computer vision.

Robot Grabbing Items

  This project was completed by myself independently. It mainly completed the robot grabbing work for stacked objects. The target object is a relatively regular rectangular object. The technical route mainly includes three contents: 3D point cloud segmentation, attitude estimation, and path planning. .
  In terms of 3D point cloud segmentation, firstly, we made a small sample data set, set the point cloud label of the object to be captured to 1, and set the rest to 0. Then the point cloud segmentation network is mainly based on DGCNN, and some minor changes have been made to the DGCNN network. Then in the project, input the point cloud pictures captured by the binocular camera into the network, and output the point cloud information of the object to be captured for subsequent processing.
  In terms of attitude estimation, use the obb rectangular bounding box to select the output point cloud of the point cloud segmentation network, and then calculate the normal vector information of the rectangular guard box to obtain the attitude information of the object to be captured, and then calculate the coordinates of the point cloud midpoint By calculating the average value, the coordinates of the center point of the object to be grasped can be obtained, so that the 6DoF pose information of the object to be grasped can be obtained.
  In path planning, the path planning algorithm based on RRT* is mainly used. By setting the position of the gripper of the manipulator as the initial point and the position of the object to be grasped as the end point, an optimal path is calculated, and then the manipulator and the obstacle The interference algorithm is mainly to set the connecting rod of the manipulator into a line segment, and then judge whether there is interference between the line and the obstacle.

Describe your point cloud segmentation network in detail:

  First of all, I performed the point cloud segmentation task based on the DGCNN network. The xyz coordinate information of the object point cloud information captured by the binocular camera of the experimental platform was used as the input of the network, and then through a space conversion module, the input point cloud was processed. A rectification, then through the three-layer edge convolutional neural network, and then extract a global feature of the point cloud through maximum pooling, and then splicing with the local features calculated by the previous edge convolution to obtain a point cloud. Local features and global features, and then output the score value of each point cloud, and finally get the segmentation result.

Why use the DGCNN network:

  First of all, PointNet was the pioneering work of deep learning technology directly applied to the direction of point cloud processing. Later, PointNet++ improved the problem that PointNet could not extract local features. DGCNN is also an improved version of PointNet, because the data set was made by itself at that time. So I tested all three networks on my own data set, and the result of the test is that DGCNN has the best effect. So temporarily use the DGCNN network for point cloud segmentation.

Why point cloud segmentation does not use traditional methods:

  First of all, the biggest difference between deep learning and traditional methods is that the features used in deep learning are automatically learned from big data, while the features used in traditional methods are mainly manually designed. However, manual design mainly relies on the designer's prior knowledge, and it is difficult to take advantage of big data. Due to the reliance on manual parameter tuning, the number of parameters allowed in the design of features is very limited. But deep learning can automatically learn representations of features from big data, which can contain thousands of parameters.
  In terms of features and classifiers, in traditional methods, the optimization of features and classifiers is separated. Under the framework of the neural network, the feature representation and the classifier are jointly optimized, which can maximize the performance of the joint collaboration between the two.

What are the traditional methods of point cloud segmentation:

  The segmentation and classification processing of point cloud is much more complicated than the processing of two-dimensional images. Point cloud segmentation is divided into region extraction, line and surface extraction, semantic segmentation and clustering, etc. The same is the problem of segmentation. Point cloud segmentation involves too many areas. Generally speaking, point cloud segmentation is the basis of target recognition.
  Segmentation: regional sound field, Ransac line-surface extraction, NDT-RANSAC, K-Means, Normalize Cut, 3D Hough Transform (line-surface extraction), connectivity analysis Classification: point-based classification, segmentation-based classification, supervised classification and unsupervised
  classification .

What are the advantages of DGCNN compared to other point cloud segmentation networks?

  PointNet lacks the consideration of local features
  , and PointNet++ constructs a graph based on the Euclidean distance of point pairs, and then uses the farthest point sampling to select points as the input of the next layer, which makes the graph of each layer continuously reduce, but the structure of the graph remains unchanged. .
  The dynamic graph of DGCNN is because the k-nearest neighbors are taken in the feature space, and the features calculated by each layer are different, so the graph equivalent to each layer has different vertices.

What is the loss function in the DGCNN network you use?

  The loss function in DGCNN is the cross-entropy loss function used.
    A detailed introduction to the cross-entropy loss function can be found here:
  The cross-entropy loss function is often used in classification tasks in deep learning, which can represent the gap between the predicted value and the ground truth. The definition of cross entropy is:
insert image description here
  P represents the probability distribution of gt, and q represents the probability distribution of the predicted value. Cross entropy evolves from relative entropy (KL divergence), log represents the amount of information, the larger the q, the greater the possibility, and the less the amount of information; otherwise, the greater the amount of information. Through continuous training optimization, the value of the cross-entropy loss function is gradually reduced to achieve the purpose of reducing the distance between p and q.

You mentioned that you used the attention mechanism, why should you add the attention mechanism:

  The remarkable advantage of the attention mechanism is that it pays attention to relevant information and ignores irrelevant information, directly establishes the dependency relationship between input and output without looping, the degree of parallelization is enhanced, and the running speed is greatly improved.
  Then it overcomes some limitations of the traditional neural network, such as the performance of the system decreases with the increase of the input length, the calculation efficiency of the system is low due to the unreasonable input sequence, and the system lacks feature extraction and enhancement. However, the attention mechanism can well model the sequence data with variable length, which further enhances its ability to capture long-range dependent information, and effectively improves the accuracy while reducing the depth of the hierarchy.
  Moreover, the attention mechanism module is relatively simple and can be easily embedded in various networks.

What is the difference between spatial attention mechanism and channel attention mechanism

Channel attention mechanism:
  The channel attention mechanism mainly calculates the importance (weight) of each channel through the network, that is, pay more attention to which channels contain key information, and pay less attention to channels without important information, so as to improve feature expression ability the goal of.
  Different weights are given to each channel
insert image description here
. The spatial attention mechanism
  is based on the channel attention mechanism and based on the direction of the channel to find which piece of position information gathers the most.
  For a certain layer, each element above is assigned a different weight.
insert image description here

How does the path planning part work?

  In terms of path planning, I do it based on RRT*, and then use the third-order Bezier curve to optimize the smoothness of the path.

A brief introduction to the RRT* principle:

  First of all, RRT* is an improvement to RRT. RRT is a method of randomly generating trees. Its main idea is to quickly expand a group of tree-like paths to explore (fill) a part of the space and find a feasible path.
The basic steps of RRT are:
  1. Initialize the entire space, define the starting point, the end point, the number of sampling points, the step size between points and other information
  2. Randomly generate a point Xrand in the space
  3. Points in the known tree Find the point Xnear closest to this random point in the collection.
  4. Intercept the point Xnew from Xnear with a step size t in the straight line from Xnear to Xrand.
  5. Judge whether there is an obstacle between Xnear and Xnew, and discard it if there is an obstacle 6.
  Add Xnew to the set of trees.
  7. Loop 2-6. Conditions for loop ending: there is a new point within the set neighborhood of the end point

Then RRT* is an improvement over RRT. Compared with RRT, RRT* has two more steps:
  1. Reselecting the parent node for Xnew
  2. The process of rewiring the random tree

Briefly introduce your project at Hikvision

During the internship at Hikvision, I mainly did a multi-spectral pedestrian recognition project, which is to use a single-stage detection framework, use multi-label learning to learn input state-aware features, and assign a separate label according to a given state of an input image pair . The main one is to use the RGB image and the thermal imaging image as the input of the network, and then the network structure part mainly uses a SSD-like network, which is composed of two independent branches, and then the thermal imaging image and the RGB image are sent to the two independent branches, and then go through four layers of convolution respectively, and they share the remaining one layer of convolution until the end. Then in the multi-fusion module, the features of each modality are concatenated. This is how the network is designed.
insert image description hereLoss function : BCE cross-entropy loss function
baseline : SSD
uses VGG16 pre-trained on ImageNet for batch normalization ( batch normalization )
dataset : KAIST dataset: multispectral pedestrian dataset, composed of 95328 in urban environments CVC-14 dataset consists of two fully overlapping RGB-thermal imaging
pairs: multispectral images, not fully overlapping data.
Evaluation index : MR (miss rate): MR = 1 - Recall, Recall = TP / GT, TP represents the predicted positive sample and the predicted result is correct, GT is the true value.

The difference between yolo and ssd:

insert image description hereSSD can achieve the same accuracy as yolo while ensuring the accuracy.

Sparse Convolution ( Sparse Convolution Net )

简介:
Sparse convolution is often used in 3D projects (such as 3D point cloud segmentation). Since point cloud data is sparse (irregular), standard convolution operations cannot be used. Similarly, in 2D tasks, if only a part of the pixels are processed, sparse convolution is also required, which helps to speed up the model.

原理解析:
In essence, it is to save the calculation results of a specific location by establishing a hash table. The principle of sparse convolution will be illustrated below by way of example.

输入数据:
As shown in the figure below, there is a 5x5 image with 3 channels. All pixels are (0, 0, 0) except for two points P1 and P2. According to [1], P1 and P2, such non-zero elements (ie P1, P2) are also called activate input sites . The shape of the input tensor is [1x3x5x5] in NCHW order. In sparse form, the [P1,P2] data list is [[0.1, 0.1, 0.1], [0.2, 0.2, 0.2]] and the index list is [[1,2], [2, 3]].
insert image description here
卷积核:
The convolution kernel of sparse convolution is the same as traditional convolution. The image below is an example with a kernel size of 3x3. Dark and light colors represent 2 convolution kernels respectively.

输出定义:
The output of sparse convolution is very different from traditional convolution. Sparse convolution has two output definitions. One is regular output definition, just like ordinary convolution, as long as the kernel covers an input point, the output point is calculated. The other is called the submanifold output definition. The convolution output is only computed if the kernel center covers the input site.
insert image description here
As shown in the figure above, 5×5 input image, 3×3 convolution kernel, stride=1, padding=0, the size of the output tensor is 3×3. The first line is the regular output definition. For example, the position (0, 0) is A1, which means that the result of this position is only related to P1 in the input image, and the position of (0, 1) is A1A2, which means that the result of this position is related to both P1 and P2. related. The second line is the submanifold output definition, and only A1 and A2 respond. Different colors should represent different channels of the output.

Guess you like

Origin blog.csdn.net/toCVer/article/details/125573914