Thesis Research | Development of Machine Vision in the Field of Unmanned Aerial Vehicles

 

0 Preface

slightly.

1 Visual object detection based on DCNN

slightly.

2. DCNN-Based Research on Visual Detection of Small UAV

2.1 UAV target detection dataset

DCNN-based target detection algorithms usually need to rely on large-scale data sets for model training and performance evaluation. However, there is still a lack of publicly available large-scale drone detection datasets in the industry. Existing datasets from the International Challenge for Drone Detection and self-built datasets from published literature are introduced below.

2.1.1 Anti-UAV2020 Dataset

The Anti-UAV2020 [44] dataset contains 160 high-quality dual-modal (visible light + near-infrared) video sequences, of which 100 videos are used for training and verification, and 60 videos are used for testing. This dataset covers commercial drones in various scenarios, scales and models (including DJIInspire, DJI-Phantom 4, DJI-Mavic Air, DJI-Mavic PRO). Example images in this dataset are shown in Figure 3. Visible light and near-infrared video data are collected by visible light and infrared photoelectric sensors fixed on the ground, respectively. The true value of the published labeled data is given by a professional data labeler, and the labeled information includes: the position and size of the detection frame, target attributes (large, medium, and small targets, day, night, clouds, buildings, false targets, sudden changes in speed) , hovering, occlusion, scale change) and a flag indicating whether there is a target in the current frame. In the second Anti-UAV2021[45] Anti-UAV Challenge Competition, the data set has been expanded to 280 segments of high-definition infrared video data, covering the rapid movement of UAV targets in various complex scenes, making UAV detection tasks more challenging.

 

2.1.2 Drone-vs-Bird Detection Challenge Dataset

The Drone-vs-Bird Detection Challenge [46] dataset contains 11 videos in MPEG4 format taken at different times, and each video file corresponds to an annotation file in XML format. As shown in Figure 4, the drones in the scene exhibit multi-scale, multi-view and brightness heterogeneity. In particular, the data set contains a large number of long-distance small-sized drones and birds. Many drones have an area of ​​less than 20 pixels, and more than 300 drones have target detection frames with side lengths as low as 3 to 4 pixels. , the detection of these tiny objects is very challenging.

 2.1.3 Self-built datasets that are not open source

In addition to the above-mentioned public datasets, many researchers have built their own datasets to train the network, and introduced them accordingly in their published papers.

The Anti-Drone Dataset established in literature [47] contains 449 videos. The drone models captured include Mavic pro, Phantom 2 and Phantom, etc. The video frame resolution is 2048×1536 and 1024×768, and the frame rate is 24 FPS. As shown in Figure 5, the video frames in this dataset cover different camera angles, magnifications, weather, day or night, etc., reflecting the complexity of the UAV object detection task.

UAV data [48] collected images of 20 types of UAVs, including 15 types of rotary-wing UAVs, 3 types of fixed-wing UAVs and 2 types of unmanned helicopters. The dataset also highlights the complexity and diversity of the background. As shown in Figure 6, the background of the drone in the picture includes 30 different locations such as residential buildings, commercial centers, mountains, forests, rivers, factories, and coasts. , which better reflects the various scenarios that the UAV detection system may encounter during actual deployment. The dataset contains 200,000 images, including 140,000 training set images and 60,000 test set images, as well as the corresponding labeled true value of each image, and the image resolution is 1920×1080.

 

2.2 UAV detection for static images

Focusing on UAV detection and early warning tasks, industry scholars have developed a considerable number of UAV target detection algorithms based on mainstream target detection algorithms. The main problems to be solved by these algorithms include: multi-scale UAV target detection based on general target detection algorithm, few-sample UAV target detection and infrared image UAV target detection, etc.

2.2.1 UAV target detection based on general target detection algorithm

UAV target detection algorithms can also be roughly divided into two-stage and single-stage algorithms according to whether they explicitly generate candidate areas, and the two types of algorithms have their own advantages. In the same data set, without any optimization method, the two-stage Faster R-CNN algorithm has a higher detection accuracy, and the single-stage YOLO series algorithm has a faster processing speed. The UAV target detection algorithm for static images proposed in the current field of computer vision is introduced as follows.

Aiming at the small size of long-distance drones in the imaging field of view, Vasileios [49] proposed a new drone target detection algorithm by adding a deep super-resolution model to Faster R-CNN training. As shown in Figure 7, the super-resolution model [50] in this algorithm uses a deep residual network to extract features and reconstruct images to improve the resolution of small UAV targets in the input image, and then improve the resolution based on Faster R-CNN. The recall of the object detection model. Celine Craye[51] and others divided the detection of drones into two steps. First, input the time-space sequence of video images into the U-Net[52] model to obtain the candidate area of ​​the drone, and then use the ResNet101 model to perform Classification, this algorithm is similar to the two-stage algorithm R-CNN, which can improve the detection effect of small target drones. However, the use of two-stage detection methods based on Faster R-CNN has certain limitations in terms of real-time calculation.

In view of the advantages of YOLO series algorithms in terms of computational efficiency, literature [53] developed a UAV target detection algorithm based on YOLOv2. However, since the YOLOv2 algorithm needs to divide the grid on the image when working, and each grid can only predict a single target at most, when multiple targets fall into the same grid, there will be missed detection. In addition, the traditional deep convolutional network is not robust to direction and scale changes in the learned features, so it is not effective for small objects and overlapping objects.

 

Literature [54] based on YOLOv3 Darknet53 backbone network uses Gabor filter to modulate the convolution kernel in DCNN, so as to enhance the robustness of features to direction and scale changes, and it is verified on the data set, and the performance exceeds that based on different scales. A method that combines Scale invariant feature transform (SIFT) features with classification models such as local feature aggregation descriptors, bag of words, and Fisher vectors. However, the algorithm has not been compared with DCNN-based target detection methods such as YOLOv3, and the advantages of the Gabor filter modulation DCNN algorithm have not been verified.

Due to the large scale variation of UAV targets in the imaging field of view, it is difficult for YOLOv3 to effectively cover the scale variation range of UAVs in the detection at three scale levels. In response to this problem, literature [55] added multi-scale feature fusion to the YOLOv3 model to detect drones with significant scale changes. Literature [48] also proposed a UAVDet model for UAV target detection based on the YOLOv3 model (as shown in Figure 8), expanding YOLOv3 to 4 scales for prediction, and adding two residuals after the second downsampling module to get more location information. It should be pointed out that since the single-stage algorithm does not explicitly generate candidate frames, the YOLO series of algorithms need to use the k-means[56] clustering algorithm to generate prior frames based on the data set. Therefore, when using the YOLO series of algorithms for target detection When, also need to use k-means for specific

Clustering of UAV datasets generates prior boxes that are more suitable for UAVs. At the same time, in order to solve the problem of motion blur in the image, Gaussian blur and motion blur methods are used to enhance the data set to effectively improve the detection accuracy and recall rate.

 

2.2.2 Application of transfer learning and data augmentation in UAV detection

As mentioned above, the DCNN-based target detection algorithm is usually a data-driven supervised learning algorithm, which needs to rely on large-scale data sets for model training and performance evaluation. However, there is currently a lack of public large-scale UAV detection data sets in the industry. Training DCNN models with sample data sets is prone to overfitting problems, so researchers use transfer learning and data augmentation to alleviate this contradiction.

Migration learning is a commonly used technique in the field of machine learning. It usually refers to the process of reusing a pre-trained model in another task. It can transfer the knowledge learned by the model in one data set to another data set. , thereby improving the generalization performance of the model. Specifically on the drone detection task, the model can be fully trained in other types of large-scale data sets (such as general target detection), and then the pre-trained network can be trained on a specific relatively small scale.

Fine-tuning on the drone detection dataset. Muhamma et al. [57] fine-tuned the model pre-trained on the ImageNet dataset on the Drone-vsBird Detection Challenge dataset, thereby enabling the model to better detect drones. The author uses the Faster R-CNN algorithm to compare the detection performance of ZFNet, VGG16 and VGG_CNN_1024 three feature extraction networks. The results show that the VGG16 model achieves relatively better performance in this data set. In the 2019 Dronevs-Bird Detection Challenge, the competition data introduced more complex target backgrounds, richer lighting conditions, and more variable picture scaling, and even many low-contrast pictures and a variety of birds. Scenes. Nalamati et al. [58] adopted a similar migration learning technology route and compared Faster R-CNN and SSD algorithms. The experimental results showed that the detection accuracy of the Faster R-CNN algorithm based on the ResNet101 network was better, but in terms of real-time There are limitations.

Data augmentation is another common means of alleviating the problem of overfitting in model training by increasing the sample size by transforming existing data or creating new synthetic data from existing data. Commonly used data augmentation methods include image geometric transformation, flipping, color modification, cropping, rotation, adding noise, random occlusion, transparency aliasing, cropping aliasing, etc. These methods can all be introduced into UAV target detection to alleviate the problem of few samples. For example, in view of the difficulty in obtaining large-scale UAV target detection data, literature [59] stitched the image blocks of birds and UAVs into different background images, and finally obtained 676,534 images, which can be used for better training. UAV object detection model.

2.2.3 Infrared Image UAV Detection

Visible light images have high resolution and usually have good texture and shape information, which is very beneficial to DCNN model for feature learning and characterization, and then realizes UAV detection. However, under poor lighting conditions such as fog or night, the image data obtained by the visible light sensor has poor visibility and it is difficult to capture the UAV target. In contrast, infrared imaging sensors have the advantages of long detection distance, all-weather work, and strong adaptability to lighting conditions, but they also have disadvantages such as small resolution, poor contrast, low signal-to-noise ratio, and lack of texture and shape information. UAV object detection in images is more challenging. Literature [60] preprocessed the infrared image by inversion, histogram equalization, denoising and sharpening, and then introduced the SPP module and the GIOU (Generalized Intersection over Union) loss function on the basis of the YOLOv3 model, which improved the model's performance for short distances and large distances. Target and edge target detection capabilities. Literature [61] uses a fully convolutional neural network to segment infrared images, uses the visual saliency mechanism to enhance small targets, and suppresses background and false alarms. The detection results are better than typical infrared target detection algorithms. Literature [62] uses the complementary characteristics of infrared images and visible light images to carry out multi-scale salient feature fusion, uses the improved YOLOv3 model for detection, and uses the attention mechanism to fuse the feature information of the auxiliary network and the backbone network to enhance the effective information channel and suppress invalid information. channel to improve the detection effect of small targets.

When the size of the UAV target in the infrared image is very small (for example, less than 9×9 pixels), it is necessary to treat the UAV target as a small infrared target for detection. Typical methods for infrared small target detection based on manual features include Gaussian difference filter, local contrast algorithm [63], two-dimensional least mean square filter [64], morphological Top-hat transform [65-66] algorithm, nonlinear image The block processing [67] model et al. Aiming at the limited adaptive ability of manual feature-based methods, some scholars have recently introduced DCNN into the field of infrared small target detection. Literature [68] transformed the small target detection problem into a small target position distribution classification problem, and used the full convolutional network to perform background suppression and target enhancement on infrared small targets, and at the same time obtain the target potential area; then input the original image and the target potential area at the same time Classification network, and then output the target detection results. The training and testing results on 50,000 images show that the method can effectively detect complex backgrounds and small objects with low signal-to-noise ratio or even motion blur. However, this method still has the problem of high false alarm rate, because in many cases, it is difficult to distinguish real small targets from non-target point objects in the background only relying on static appearance features. Therefore, in complex backgrounds and low signal-to-noise

Effectively exploiting spatio-temporal context information for infrared small object detection is still a challenging task [69].

2.3 UAV detection for video data

UAV detection for video data is the core task of UAV detection. On the one hand, the UAV detection data based on photoelectric sensors is usually video data (that is, image sequences), and on the other hand, it cannot be detected on a single frame of static image. When identifying targets, it is necessary to use contextual space-time information in video data for target enhancement and detection and recognition. However, there are several difficulties in implementing UAV detection based on video data. One is that there is a lot of redundant information between consecutive frames in the video sequence; the other is that the background of complex motion patterns will greatly interfere with target detection; the third is that the violent movement of the drone or the out-of-focus of the sensor lens will cause the appearance of the target to be blurred . Therefore, UAV detection for video data needs to combine static appearance information and target-specific motion information (i.e., contextual information in the air domain and temporal domain) for discrimination. As mentioned above, a considerable number of object detection methods for static images have been proposed in the field of computer vision, but there are relatively few researches on object detection for video data, especially UAV detection. The existing work mainly relies on optical flow And timing features to represent motion information, and then better realize the target detection task in video data.

2.3.1 Video object detection based on optical flow field 

Video moving object detection is the process of detecting moving objects in a continuous video image sequence. Moving object detection methods include two-frame/multi-frame difference method, background suppression method, and optical flow method. most effective. The concept of optical flow usually refers to the instantaneous speed (including speed and direction) of moving objects in space on the imaging platform. If there is no moving target in the image, the optical flow in the entire image changes continuously; if there is a moving target, the optical flow field formed by the moving target will be different from the optical flow field of the background, and then the moving target and the background optical flow field will be different. background to distinguish. The effective calculation method of the optical flow field was first proposed by Horn and Schunck [70] in 1981. This method assumes that the instantaneous gray value of the object is constant and changes smoothly over the entire image to solve the optical flow. Lueas and Kanade [71] proposed an improved optical flow algorithm, assuming that the motion vector remains constant over a small spatial domain, and then using weighted least squares to estimate the optical flow.

However, the above methods need to iteratively calculate the optical flow, and usually the calculation amount is relatively large. More importantly, this type of method is too strict on the assumption that the brightness of successive frames of the image is constant, so the accuracy of optical flow calculation under complex lighting conditions is limited. In 2015, Fischer transformed the optical flow calculation into a supervised learning problem, and proposed the FlowNet [72] method based on deep learning. As shown in Figure 9, the input of the FlowNet model is two consecutive frames of images (supporting RGB images). The network is divided into two parts: convolutional downsampling and deconvolutional upsampling. The downsampling network is responsible for layered extraction of features and encoding of advanced Semantic information, the deconvolution network uses advanced semantic information decoding and layered extraction features to predict optical flow, and with the help of a large amount of data training, the optical flow calculation performance is significantly improved. The follow-up FlowNet2.0[73] model and RAFT[74] model further improved the optical flow computing capability based on DCNN.

 

In view of the fact that the optical flow field has many excellent properties in the representation of object motion information, it can be expected that the introduction of optical flow information into video moving object detection will significantly improve the performance of video object detection. One idea is to use optical flow information to eliminate redundant information between consecutive frames of the image. For example, literature [75] found that the feature maps of adjacent frame images extracted by the DCNN model are usually very similar, so using the DCNN model to process video frame by frame will consume a large amount of unnecessary computing resources, so it can be processed at a fixed time interval only Select and process keyframes, and the features of non-keyframes can be obtained from the features of keyframes with the help of optical flow information migration. Since the calculation speed of optical flow is much higher than that of DCNN feature extraction, this method greatly reduces the calculation amount of video processing, thereby improving the speed of video target detection. However, this method is mainly suitable for cases where moving objects and backgrounds change continuously between adjacent frames. Another way to use optical flow information for video moving object detection is to superimpose optical flow information with static appearance information, thereby further increasing the difference between the object and the background. Literature [76] uses the DCNN model to obtain the appearance feature map of the current frame and the reference frame, and uses the FlowNet model to predict the optical flow field of the current frame and the reference frame, and then superimposes the appearance feature map and the optical flow information of the corresponding frame into a spatio-temporal hybrid feature , and then obtain object detection results based on the spatiotemporal mixture feature maps of the current frame and the reference frame. This method effectively utilizes the spatio-temporal information of video data and helps to solve the problem of blurring moving objects, thus significantly improving the performance of object detection. However, this method has certain requirements on target strength and local SNR, and is mainly suitable for offline video target detection, and needs improvement in real-time online target detection. With the help of UAV video and its labeled data, these optical flow field-based object detection models can be effectively transferred to UAV detection tasks.

2.3.2 UAV detection based on multi-frame correlation features

The optical flow method can effectively represent the target motion information when the video image quality is high, but it is easy to fail when the target is blurred or extremely faint. To solve this problem, Rozantsev et al. [77] used multiple consecutive frames in the temporal dimension to accumulate the target energy to achieve the purpose of target enhancement. As shown in Figure 10, first use sliding windows of different scales to obtain a spatiotemporal image cube (SpatioTemporal Image Cube) in the image sequence; then perform motion compensation on each cube to obtain a spatiotemporal image stabilization cube, this operation can greatly enhance the candidate target. energy, enhance the local signal-to-noise ratio of potential targets; finally, a classifier is used to judge whether the space-time image stabilization cube contains targets, and the target detection results are optimized through non-maximum suppression technology. Compared with optical flow-based methods, this method has significantly improved ability to resist complex background interference and object motion blur.

Due to the loss of time dimension information in the convolutional neural network training process, the spatiotemporal consistency of features cannot be guaranteed. In addition to the above-mentioned method of using motion compensation to obtain spatiotemporal stable features, some researchers proposed to input image sequences into neural networks to extract hidden features. The motion information included mainly includes Siamese [78] and Recurrent Neural Network (RNN) [79] networks. Literature [80] proposed a target detection framework based on a fully convolutional neural network. This framework uses the Siamese network to extract time series information. At the same time, RNN as a time series model can also provide time series information. In the recurrent neural network, the current The output of the layer is not only related to the input, but also depends on the input at the previous moment, which makes the neural network have a "memory" function. RNN is mainly used in the field of natural language processing.

UAV detection for video data usually encounters dynamic non-target disturbances such as tree branches and flying birds in practical applications, and it is difficult to distinguish them from real targets simply by using inter-frame optical flow information. In response to this problem, literature [81] found that as an artificially designed aircraft, the flight dynamics of UAVs has certain specific laws, so a UAV based on multi-frame target shape change characteristics and track laws was proposed. The target detection method can reduce the false alarm rate of target detection to a certain extent. However, the target segmentation process of this method is based on the background difference method, so it has high requirements for the complexity of the background motion and the magnitude of the sensor motion (including movement, rotation, and disturbance).

 the following

slightly.

Interested students can go to Zhiwang to download this paper.


Article source: Yang Xin1,2, Wang Gang2,3, Li Liang2, Li Shaogang1,2, Gao Jin4, Wang Yizheng2. Research progress on detection of small civilian drones based on deep convolutional neural network[J]. Honghe Technology, 2022,44(11)


At present, many domestic AI companies are working on the combination of machine vision and drone applications, and it is also a very popular sector, but there is still a result of high cost and unsatisfactory accuracy. Here I would like to recommend a domestic machine vision platform—Coovally, which is a machine vision platform that includes a complete AI modeling process, AI project management, and AI system deployment management. The development cycle can be shortened from months to days, and the development, integration, testing and verification of AI vision solutions can be accelerated. Help improve the enterprise's AI stack software development capabilities, so that advanced AI systems can be popularized at a lower cost and faster. "Package its own AI capabilities" for business personnel to use, so as to realize "teaching people how to fish". At present, Coovally has covered multiple application areas, including manufacturing quality inspection, geological disaster monitoring, power industry equipment monitoring, medical special disease diagnosis, smart transportation, smart parks, etc.

Guess you like

Origin blog.csdn.net/Bella_zhang0701/article/details/128240114