基于深度学习的目标检测框架

1. A Mobile Outdoor Augmented Reality Method Combining Deep Learning Object Detection and Spatial Relationships for Geovisualization

geovisualization：地学可视化

1.1 Deep-Learning-Based Object Detection

在这里插入图片描述
Figure 1. An overview of R-CNN. This method first takes an image as an input, then extracts approximately 2000 region proposals and computes features from each proposed region using a deep CNN, and finally uses linear SVMs to classify those proposed regions.
图 1. R-CNN 的整体架构。该方法首先将图像作为输入，然后提取大约 2000 个候选区域并使用深度 CNN 计算来自每个候选区域的特征，最后使用线性 SVM 来对这些候选区域进行分类。

在这里插入图片描述
Figure 2. An overview of SSD. SSD first takes an image as input, then extracts features by means of a base network (e.g., a truncated VGG-16 network without classification layers) and several additional feature layers to obtain multi-scale feature maps, subsequently obtains initial detection results through multiway classification and box regression using a set of convolutional filters, and finally applies Non-Maximum Suppression (NMS) to eliminate redundant results.
图 2. SSD 的整体架构。SSD 首先将图像作为输入，然后通过基础网络 (例如，没有分类层的截断的 VGG-16 网络) 和若干额外的特征图层来提取特征以获得多尺度特征图，随后通过使用一组卷积滤波器的多路分类和 box 回归得到初始检测结果，最后应用非极大值抑制 (NMS) 来消除冗余结果。

在这里插入图片描述
Figure 3. Macro architectural view of SqueezeNet v1.1 (inspired by Figure 2). Processing begins with a convolutional layer (conv1), followed by 8 fire modules (structures proposed in SqueezeNet, which have fewer parameters than normal convolutional layers without sacrificing competitive accuracy), and ends with a convolutional layer (conv10) and a softmax classifier. SqueezeNet takes as input a 224 $\times$ 224 pixel image with 3 colour channels (R, G and B).
图 3. SqueezeNet v1.1 的宏观架构视图 (灵感来自图 2)。处理从卷积层 (conv1) 开始，接着是 8 个 fire module (SqueezeNet 中提出的结构，其参数比普通卷积层少，而不牺牲精度)，并以卷积层 (conv10) 和 softmax 分类器结束。SqueezeNet 将具有 3 个颜色通道 (R，G 和 B) 的 224 $\times$ 224 像素图像作为输入。

sacrifice ['sækrɪfaɪs]：n. 牺牲，祭品，供奉 vt. 牺牲，献祭，亏本出售 vi. 献祭，奉献

在这里插入图片描述
Figure 4. The proposed lightweight SSD architecture (inspired by Figure 2). This architecture follows a design similar to that of the original SSD. The main differences are that it takes a 224 $\times$ 224 pixel image as input and then uses a truncated SqueezeNet (rather than VGG-16) and a series of additional layers (at lower depths than the original) to extract features from the image. The features it uses for detection are selected from 5 layers: fire9 (the last fire module in the SqueezeNet), Ex1_2, Ex2_2, Ex3_2 (three convolutional layers) and GAP (a global average pooling layer).
图 4. 提出的轻量级 SSD 架构 (受图 2 启发)。该架构遵循类似于原始 SSD 的设计。主要区别在于它采用 224 $\times$ 224 像素的图像作为输入，然后使用截断的 SqueezeNet (而不是 VGG-16) 和一系列附加层 (比原始深度更低的深度) 从图像中提取特征。用于检测的功能的特征从 5 层中选择：fire9 (SqueezeNet 中的最后一个 fire module)，Ex1_2，Ex2_2，Ex3_2 (三个卷积层) 和 GAP (全局平均池层)。

在这里插入图片描述
Figure 5. The 2D screen coordinate system, the 3D real world coordinate system and the relationships between them. A detected bounding box is described in the 2D screen coordinate system. The 3D real world coordinate system is established on the basis of the view frustum created by the visual sensor, with the origin at the centre of the visual sensor. The X and Y axes are parallel to the screen. The Z axis, which corresponds to the negative direction of the visual sensor’s orientation, is perpendicular to the screen. The 2D coordinates of the detected bounding box can be converted into target bounding box coordinates on the target plane in the 3D real world coordinate system for virtual object registration.
图 5. 2D 屏幕坐标系，3D 真实世界坐标系以及它们之间的关系。在 2D 屏幕坐标系中描述检测到的边界框。3D 真实世界坐标系是在视觉传感器创建的视锥体的基础上建立的，其原点位于视觉传感器的中心。X 和 Y 轴与屏幕平行。Z 轴对应于视觉传感器方向的负方向，垂直于屏幕。可以将检测到的边界框的 2D 坐标转换为 3D 真实世界坐标系中的目标平面上的目标边界框坐标，以进行虚拟对象登记。

virtual [ˈvɜ:tʃuəl]：adj. 虚拟的，实质上的，事实上的 (但未在名义上或正式获承认)
registration [redʒɪ'streɪʃ(ə)n]：n. 登记，注册，挂号

在这里插入图片描述