An Empirical Study of Remote Sensing Pretraining (An Empirical Study of Remote Sensing Pretraining) (2)

4. Fine-tune downstream tasks

The RSP pre-training model ResNet-50-E300 with epoch-300 of ResNet-50, the RSP
pre-training model Swin-T-E300 with epoch-300 of Swin-T, and
the epoch-100 of ViTAEv2-S as ViTAEv2-S The RSP pre-training model ViTAEv2-S-E100.

Use the pre-trained models of the above three networks to start fine-tuning downstream tasks, including image recognition, semantic segmentation, object detection, and transformation detection. The data set used for scene recognition is commonly used in aerial photography scenes, and is no longer MillionAID.

4.1 Aerial Scene Recognition

4.1.1 Dataset: ① UCM dataset ② AID dataset ③ NWPU-RESISC dataset


1. UCM: This is the most important dataset for scene recognition. It contains 2100 images, all of size 256 × 256, with a pixel resolution of 0.3m. These 2100 images belong to 21 categories respectively. So, each category has 100 images. All samples were manually extracted from large images from various urban areas across the country collected in the USGS National Map Urban Area Imagery Database.
2. AID: This is a challenging dataset generated by collecting images from multi-source sensors on GE. It has high intra-class diversity because images are carefully selected from different countries. They were extracted at different times and seasons under different imaging conditions. It has 10000 images of size 600 × 600 belonging to 30 categories.
3. NWPU-RESISC This data set is characterized by a large number of samples. It contains 31,500 images and 45 categories, where each category has 700 samples. Each image has 256 x 256 pixels. The spatial resolution varies from 0.2m to 30m. Some special terrains, such as islands, lakes, ordinary mountains and snow mountains, may have lower resolution.

4.1.2 Implementation details and experimental setup

Set the training-verification ratio of each data set to UCM (8:2), AID (2:8), AID (5:5), NWPU-RESISC (1:9), NWPU-RESISC (2:8).

4.1.3 Experimental results

Results of selected models and SOTA methods under different settings on three scene recognition datasets.  Bold in the last three groups indicates the best effect, and "*" indicates the best effect among all models.
Results of selected models and SOTA methods under different settings on three scene recognition datasets. Bold in the last three groups indicates the best effect, and "*" indicates the best effect among all models.

4.2 Aviation Semantic Segmentation

Aerial semantic segmentation is also a classification task similar to aerial scene recognition, but it is at the pixel level instead of the scene level. We then evaluate the above three models on the aerial semantic segmentation task, which includes scene parsing and object segmentation subtasks, the former focusing on labeling each pixel of the whole scene, and the latter focusing on the segmentation of foreground objects.

4.2.1 Datasets

1. ISPRS Potsdam dataset: This dataset is published by ISPRS Committee WG II/4. It covers a large scene of 3.42 square kilometers in the city of Potsdam. Contains 38 images with an average size of 6000 × 6000 pixels and a resolution of 0.5m. Among them, the training set and test set have 24 and 14 images respectively. These scenes contain 6 categories, namely impervious surfaces, buildings, low vegetation, trees, cars and debris.
2. iSAID: This is a large-scale dataset mainly used for instance segmentation. In addition, it provides semantic masks of 15 foreground and 1 background categories containing aerial objects. It consists of 2,806 high-resolution images ranging from 800 × 800 to 4,000 × 13,000 pixels. The training set, validation set, and test set have 1411, 458, and 937 images, respectively. In this paper, only the validation set is used for evaluation since the test set is not available.

4.2.2 Implementation Details

Sampling ISPRS Potsdam and iSAID respectively, cropped into patches with sizes of 512 × 512 and 896 × 896, strides of 384 and 512, respectively, using random horizontal flipping data augmentation strategy.

4.2.3 Experimental results

 Results of different backbone segmentation models and classification methods on the ISPRS Potsdam dataset test set

 

Results of ultra-clean segmentation models with different skeletons and different segmentation methods on the validation set of the iSAID dataset

 4.3 Aerial target detection

Since aerial imagery is taken from the top down in the sky, objects can appear in any orientation in a bird's eye view. Aerial object detection is thus oriented bounding box (OBB) detection, which is distinguished from the horizontal bounding box (HBB) task usually on natural images]. In this paper, similar to segmentation, different detection datasets are also used in experiments. Specifically, the subtasks of multi-class remote sensing object detection and single-class ship detection are evaluated separately.

4.3.1 The data set is the same as DOTA and HRSC2016 data set

DOTA: This is the most famous large-scale dataset for OBB detection. It contains a total of 2806 images ranging in size from 800 × 800 to 4000 × 4000, including 188282 instances of 15 categories. There are 1411/458/937 training sets, validation sets, and test sets, respectively.
HRSC2016: This is a specialized ship detection dataset where bounding boxes are annotated in arbitrary orientations. Contains 1061 images ranging in size from 300 × 300 to 1500 × 900. In the official sector, 436/181/444 images are used for training, validation and testing, respectively. The dataset has only one class, since there is no need to identify the type of ship.

4.3.2 Implementation details and experimental setup

The DOTA dataset is sampled and cropped to 1024 × 1024 blocks with a stride of 824, while the HRSC2016 image is scaled keeping the aspect ratio of the shorter side at 800 and the length of the longer side less than or equal to 1333. Data augmentation during training includes random horizontal and vertical flipping. For convenience, the original training set and validation set are combined for training, while the original test sets of DOTA and HRSC2016 are used for evaluation. We report the mean precision (mAP) across all classes and the mean precision (AP) for each class on the corresponding test set.

4.3.3 Experimental results

Results of orcn detection models with different backbone and sota methods on the test set of the DOTA dataset

 

The results of the orcn detection model of different skeletons and sota methods on the hrsc2016 dataset test set

 5 Conclusion

In this study, we investigate the remote sensing pre-training problem on MillionAID, the largest remote sensing dataset based on CNN and Vision Transformer, and comprehensively evaluate their performance on four related tasks. Includes scene recognition, semantic segmentation, object detection, and change detection, and is compared with ImageNet pre-training and other Sota methods. Through a comprehensive analysis of the experimental results, the following conclusions are drawn:
1. Compared with traditional CNN models, the visual transformer is competitive in a series of remote sensing tasks, and on some more challenging datasets, such as iSAID and DOTA, they can get better performance. In particular ViTAEv2-S, an advanced model that introduces CNN inductive biasing to the visual transformer, achieves the best performance on almost all settings of these tasks.
2. Benefiting from the large capacity of the ImageNet-1K dataset, the classic IMP enables the deep model to learn more general representations, which can be well generalized to almost all categories in downstream tasks. Therefore, IMP can produce competitive baseline results despite aerial scenarios. RSP is comparable to IMP and performs very well on some specific categories, such as "bridge" and "aircraft", due to mitigating the data level difference between upstream pre-training tasks and downstream tasks.
3. Task-level differences can also negatively affect the performance of RSP. If the required representation granularity for a specific downstream task is closer to the upstream pre-training task, namely scene recognition, RSP usually leads to better performance.

We hope this study can shed useful light on our use of advanced vision transformers and remote sensing pre-training. In future work, we will study RSP on large-scale datasets for downstream tasks, as well as unsupervised pre-training considering the large amount of unlabeled data in this field.

Supongo que te gusta

Origin blog.csdn.net/weixin_42715977/article/details/130842753
Recomendado
Clasificación