Analysis of small target detection technology

Analysis of small target detection technology

The small target detection and tracking system is divided into four modules:

· Hardware module

  The module is based on the standard PCI bus and is equipped with a very large-scale programmable chip (DSP, FPGA), which has extremely strong computing and processing capabilities.

· DSP program module

  Its function mainly realizes the detection and tracking of small targets under complex background. Considering the real-time requirements of the system, infrared energy accumulation technology is adopted for the detection of small targets under complex backgrounds. First, the infrared image is enhanced; then the small target is coarsely detected; finally, the small target is extracted on the video sequence. At the same time, the prediction technology is used to estimate the possible location and existence area of ​​the target, so as to achieve real-time and accurate tracking (or memory) of the target. The system operates in accordance with the four states of search, capture, tracking, and memory tracking and their conversion to achieve real-time detection and tracking of small targets.

· Driver module

  Its main function is to realize data communication and information interaction between the hardware module and the upper-layer application program. The system adopts the single-cycle read / write of PCI 9054 Target; in order to meet the real-time transmission and processing requirements of 25 frames per second of image data transmission, the PCI 9054 Scatter / Gather DMA data transmission is adopted. In the information interaction of the entire system, a one-time handshake protocol is adopted, that is, request-response protocol.

 · Upper application module

The main function of this module is to download the DSP tracking program to the hardware module, start / stop the DSP, display the scene video in real time, store the moving target sequence in real time, and analyze and display the basic characteristics of the moving target sequence in real time.

 

 

 Object detection (object detection) is to accurately find the location of an object in a given picture and mark the category of the object. Therefore, the problem to be solved by target detection is the whole process of where and what the object is. However, in actual photos, the size of the object varies greatly. The angle, posture, and position of the object are different. There may be overlap between objects, which makes the difficulty of target detection very difficult. Big.

Target detection has made great progress in recent years. The main reason is that the application of convolutional neural networks in target detection tasks has replaced the original feature extraction method based on manual rules.

 

 

 

 

 

  • Traditional target detection algorithm:
    Cascade + HOG / DPM + Haar / SVM and many improvements and optimizations of the above methods.
    In traditional target detection, the multi-scale deformable part model DPM (Deformable Part Model) performs better, and has continuously won the VOC (Visual Object Class) 2007 to 2009 detection champion. DPM treats objects as multiple components (such as the nose, mouth, etc. of human faces), and describes the objects with the relationship between the components. This feature is very consistent with the non-rigid body characteristics of many objects in nature. DPM can be seen as an extension of HOG + SVM, which inherits the advantages of both, and has achieved good results in tasks such as face detection and pedestrian detection, but DPM is relatively complicated and the detection speed is also slower, which also Many improved methods have emerged.
    However, there are two main problems with traditional target detection: one is that the area selection strategy based on sliding windows is not targeted, the time complexity is high, and the windows are redundant; the second is that the manually designed features are not very robust to changes in diversity Sex.
  • Target detection algorithm based on deep learning:

o Candidate region / frame (Region Proposal) + deep learning classification algorithm:
by extracting candidate regions, and performing a deep learning method-based classification scheme on the corresponding region, such as R-CNN (Selective Search + CNN + SVM) , SPP-net (ROI Pooling), Fast R-CNN (Selective Search + CNN + ROI), Faster R-CNN (RPN + CNN + ROI), Mask R-CNN (Mask Prediction Branch + RPN + CNN + ROI), etc. .

o Regression algorithms based on deep learning:
YOLO, SSD, YOLOv2, YOLOv3 and other algorithms

At present, deep learning methods in the field of target detection are mainly divided into two categories: two stage target detection algorithms; one stage target detection algorithms. The former is that the algorithm first generates a series of candidate frames as samples, and then classifies the samples through the convolutional neural network; the latter does not need to generate candidate frames, and directly transforms the problem of target frame positioning into regression problems. It is precisely because of the difference between the two methods that the performance is also different, the former is superior in detection accuracy and positioning accuracy, and the latter is superior in algorithm speed.

Small target detection

There are two ways to define a small target. One is the relative size. If the length and width of the target size is 0.1 of the original image size, it can be regarded as a small target. The other is the definition of absolute size, that is, the size is less than 32 * 32. Pixel targets can be considered small targets.

Small target detection has always been a problem in deep learning convolutional neural network models. Most of the early target detection frameworks are aimed at common targets, such as the classic single-stage methods yolo and ssd, the two-stage method faster-rcnn, etc. These methods are mainly designed for common target data sets, so For small targets in the image, the detection effect is not ideal.

 

 

 The methods proposed to solve the small target problem are:

  • The zoom of the image. It is also the most trivial direction—zooming the image before detection. However, because large images become too large to fit into the GPU for training, simple upgrades are not effective. Ao et al. [2017] first downsampled images, and then used reinforcement learning to train attention-based models to dynamically search for regions of interest in the images. Then the high-resolution study of the selected area can be used to predict smaller targets. This avoids the need for equal attention and analysis of each pixel in the image, saving some computational costs. Some papers [Dai et al., 2016b, 2017, Singh and Davis, 2018] use image pyramids when training in the context of target detection, and [Ren et al., 2017] use them during testing.
  • Shallow network. Small objects are more likely to be predicted by detectors with smaller fields. Deeper networks have larger acceptance domains, and it is easy to lose some information about smaller objects in thicker layers. Sommer et al. [2017b] proposed a very shallow network with only four convolutional layers and three fully connected layers for detecting targets in aerial images. This type of detector is very useful when the desired instance type is small. However, if the expected instances have different sizes, the context information is better. Take advantage of the context surrounding small object instances. Gidaris and Komodakis [2015], Zhu et al. [2015b] use context to improve performance, and Chen et al. [2016a] specifically use context to improve the performance of small objects. They extended R-CNN with contextual patches, in parallel with the suggested patches generated by the regional suggestion network. Zagoruyko et al. [2016] combined their method with the depth mask object proposal to make information flow through multiple paths.
  • Super resolution. There are also image enhancements for small targets. The most typical is the use of generating adversarial networks to selectively increase the resolution of small targets.
  • Its generator learns to enhance the poor representation of small objects into super-resolution objects, which are very similar to real large objects, enough to deceive competing discriminators.

In the past two years, the method of using multi-layer feature maps (feature pyramid, RNN idea, layer-by-layer prediction) has been proposed, which has significantly improved the effect of small target detection.

The mainstream algorithms at this stage are:
Image Pyramid: It was proposed earlier to multisample image pyramids from training images. Through upsampling, the fine-grained features of small targets can be enhanced, and in theory, the positioning and recognition effects of small target detection can be optimized. However, training convolutional neural network models based on image pyramids has very high requirements on computer computing power and memory. The development of computer hardware has not been competent so far. Therefore, this method is rarely used in practical applications.

Layer-by-layer prediction: This method makes a prediction for the output of each layer of the feature map of the convolutional neural network, and finally obtains the result after comprehensive consideration. Similarly, this method also requires extremely high hardware performance.

Feature pyramid: refer to the feature information of the multi-scale feature map, taking into account both strong semantic features and location features. The advantage of this method is that multi-scale feature maps are inherent transition modules in convolutional neural networks. Stacked multi-scale feature maps have little increase in algorithm complexity.

RNN idea: refer to the threshold mechanism, long and short-term memory in the RNN algorithm, and record multi-level feature information at the same time (Note: There is an essential difference from the feature pyramid). But the inherent defect of RNN is that the training speed is slow (some operations cannot be matrixed).

The so-called small target depends on the absolute small target (pixel) and the relatively small target (relative to the length and width of the original image). Large targets and small targets are only related to the receptive field. Cnn itself can detect objects of any scale. ssd is not very suitable for small target detection, but there are problems with R-FCN speed and robustness.

There are many types of small goals, and a single background is better. There is a small face detection using fullyconvolutionalnetwork (FCN) + ResNet. This paper uses small peripheral information such as hair and shoulders to detect small targets.

First of all, the small target pixels with few features are not obvious, so compared with large targets, the detection rate of small targets is low, which is unavoidable with any algorithm. What about the difference in small target detection between different algorithms? Single-stage multi-scale algorithms such as SSD and YOLO require small resolution for small target detection. SSD does not reuse the high-resolution underlying features, and these layers are important for detecting small targets, so they are mainly performed at the bottom feature layer Detection, such as conv4_3 in SSD, but the semantic information of the bottom feature layer is not rich enough, which is a contradiction, but if the convolution layer is deep enough, the impact is not so great. I think the most important factor is due to the poor scale setting. The default anchor in SSD is 0.1 ~ 0.2. The minimum detection size for 720p image is 72 pixels, which is still too large. In fact, the source code of the SSD allows a feature layer to be a sliding window of multiple scales. Initializing each element in the parameter min_sizes to a list can generate anchors of different scales in the corresponding feature layer. If you design carefully, Basically, it can cover a small enough target, but the number of anchors is estimated to increase a lot at this time, and the speed will also decrease.

Faster rcnn, yolo, and ssd are not good for small target detection results, because the last feature map of the convolutional network structure is too small, for example, 32 * 32 targets become 2 * 2 after vgg, resulting in subsequent detection and The return cannot meet the requirements. The deeper the convolutional network, the stronger the semantic information, and the lower the layer, the more local appearance information is described, and I think it is meaningful to design so many VGG16 convolutional layers. If the front effect is good, VGG ’s Researchers should think of reducing the number of layers. I feel that it can be considered to extract multiple layers of features so that the expression ability is stronger. For example, the sample cat and dog image, the ground truth of the smaller cat only appears at the bottom layer, the ground truth of the higher layer does not have the cat's ground truth, and the ground truth matched by the larger object dog is on the high-level feature map), the information of other layers is simply stitched ( So the semantic information of small object detection, contex information is worse).

SSD is a multi-scale feature map for paopasal extraction, and ssd is more stable to small targets than yolo. yolo directly obtains the prediction results through global features. It depends entirely on data accumulation. For small targets, I think we should consider reducing pooling.

The layer responsible for detecting small targets in the SSD is conv4_3 (38 * 38), and the corresponding scale is 0.2 (can be set manually). This corresponds to the scale that SSD can detect is about 0.2. In this case, if the object is too small, even in the training stage, GT can not find the corresponding default box to match it, how can the effect be better. If you do n’t mind the overall detection effect, you can lower the scale to see if the detection effect for small targets has improved. In addition, the use of multi-scale detection can also improve the detection effect of small objects.

The VGG16 used by SSD as feature extraction, the resolution of the conv4_3 feature map has been reduced by 8 times, and the resolution of conv5_3 has been reduced by 16 times. For example, a 32 * 32 size object, the feature map of conv5_3 in vgg16 corresponds to only 2 * 2 , Location information has a greater loss. There are two ways to solve this problem: 1. Use features of different layers, such as hyperNet, MultiPath. 2. The resolution of the feature map will not be reduced too much when the feeling is not reduced. For example, using the Hole algorithm used in DeepLab, the resolution change is small while ensuring the feeling field.

The main reason for their poor detection of small targets is that SSD / YOLO scaled the original image. Because of the receptive field, it is difficult to detect "small relative size" targets. If the RCNN series does not scale the original image, but if the "absolute size is small", there is no way, because to a certain depth of the Feature map, the small target may have lost the response. 1. Small targets tend to rely more on shallow features, because shallow features have higher resolution, but poor semantic differentiation. 2. SSD detection and classification are done together, and some results that are detected but the classification is fuzzy and the score is not high are filtered out. And rpn will not, the first 200 candidates continue to classify, there will be a result. 3. For speed, it was originally a fully convolutional network, but it also fixed the input size, which greatly affected the small target of the big picture.

Some better views

The resolution of CNN features is poor, which is no better than other low-level (shallow) features. The evidence is that in pedestrian detection, some hand-crafted features are still good; Faster -The problem of rcnn and SSD itself, the original Faster-rnn fixed the shortest side of the input in RPN to 600> SSD512> SSD300. The reason why SSD uses 300 and 512 is to improve the detection speed, so the SSD is so fast; In order to ensure accuracy, multi-scale and data augmentation are added to SSD (especially note this augmentation, data augmentation).

Yolo and SSD are really weak for small objects, and some common image algorithms are much better for detecting small objects, but they are not very robust. You can try R-FCN, I tested a few, it seems to be OK for small objects, but the speed is slower. It looks like 0.18s under 970. I have done R-FCN experiments on VGG16 before, using the same res101-proposal (only focus on the detection effect so I use the same), the effect is not as good as fast rcnn. Similarly, it is not as good as google-v1 (which is also full convolution). I guess it is the overfitting problem of the shallow network (because the use of VGG's proposal to do the effect is very bad).

SSD is a detector based on a fully convolutional network, using different layers to detect objects of different sizes. There is a contradiction in this, the front feature map is large, but the semantic is not enough, and the subsequent sematic is enough, but after too much pooling, the feature map is too small. To detect small objects, you need a large enough feature map to provide more elaborate features and more dense sampling, but also enough semantic meaning to distinguish it from the background. I asked the author of the SSD when attending the conference. If the final feature map is enlarged and connected to the front, will it improve the performance? The author said that it is worth a try.

SSD is a class aware RPN with a lot of bells and whistles. Each pixel on a feature map corresponds to several anchors. This network trains anchors to drive feature training. This is the preface. As a small object, its corresponding anchor is relatively small (an anchor with gt overlap> 0.5), that is to say, it is difficult to fully train pixels on the corresponding feature map. Readers can make up for each large ROI may cover many anchors, then these anchors have the opportunity to be trained. However, a small object cannot cover many anchors. What's the problem with not being fully trained? During the test, the prediction result of this pixel may be messy, which will greatly interfere with the normal result. The reason why SSD data augmentation can increase so much is because through random crop, each anchor is fully trained (that is, a small object cropped out, it becomes a large object in the new picture) can only be said The result of this without region propisal is naturally not good at small objects. Only by hacking up hacks can you compare it slowly.

I tried the first convolution of the SSD as a deep residual network, and the detection of small objects is not bad, much better than YOLO. In addition, in the original SSD paper, the basic size of multi-level objects is from 0.1 to 0.8. In fact, the ratio of 0.1 is still very large. For example, the input of 1024 * 1024 is 0.1 to 102, which is not small. You can adjust the level according to your needs. I use 1/64 ~ 1, which is the size of the grid of different levels. Of course, after the level is changed from linear to exponential, each deformation above the basic size also needs to be adjusted (mainly to become larger), otherwise it may not be possible to cover some objects of certain sizes located in the middle of the two grids. The place where YOLO compares the pit is that the penultimate layer is fully connected, and the back 7 * 7 grid cannot be expanded too much, otherwise the front full connection will burst. The grid is large, and the same grid may include multiple small objects, making detection difficult. The role of YOLO full connection is to integrate global information. To remove the full connection, you need to increase the receptive field of the last point, so the depth residual is very suitable. When it comes to depth residuals, let's talk about another point. In another article R-FCN: Object Detection via Region-based Fully Convolutional Networks by the author of the depth residual error Kaiming Dashen, spatial pooling is used. It is found that the depth residual is a natural structure with spatial pooling effect. To add, the CNN structure in front of the SSD must be fully symmetrical, especially when pooling, so that the center of the perception field at the last output point coincides with the center of each grid. If it is asymmetric, it will be tangled when the feature is extracted in front of CNN, resulting in a large number of small objects that are not easy to learn. But YOLO finally had a full connection, which was fine, and the problem was not big. In the last sentence, the depth residual is its own spatial pooling structure. How do you understand this? The smallest unit in the depth residual is one of two convolution barriers on one side. Assuming that the function is only to translate the image and the other is directly connected, then the final joint is a spatially related combination.

 

Guess you like

Origin www.cnblogs.com/wujianming-110117/p/12729934.html