"Paper Reading 21" Equivariant Multi-View Networks

   1. Paper

  • Research field: Computer Vision | Implementing equivariance in multi-view data processing
  • Thesis: Equivariant Multi-View Networks
  • ICCV2019

  • Paper link
  • Video link

2. Brief description of the paper

In computer vision, models have a certain degree of responsiveness to changes in data (e.g., point clouds, images, etc.) under different viewing angles. In order to enable the model to better adapt to this change, instead of just training on data from a specific perspective, researchers proposed the concept of equivariant multi-view networks. Ability to process multi-view data simultaneously and maintain data equivariance through shared weights or other mechanisms.

3. Detailed description of the paper

Equivariant multi-view network

  • Abstract

Deep neural networks pre-trained on natural images are used to independently process multiple views of the input image, and view arrangement invariance is achieved by performing a round of pooling on all views . We believe that this operation discards important information and results in substandard global descriptors . In this paper, we propose a group convolution method for multi-view aggregation , i.e., convolution on a discrete subgroup of the rotation group, enabling joint inference over all views in an equivariant (rather than invariant) manner. , until the last layer. We further develop this idea to operate on smaller discrete homogeneous spaces in rotation groups, here using a polar view representation that preserves equivariance with only a fraction of the number of input views . We establish new state-of-the-art results on multiple large-scale 3D shape retrieval tasks and demonstrate additional applications in panoramic scene classification.

  • Previous work: Using deep neural networks pre-trained on natural images to independently process multiple views of the input image, achieving view arrangement invariance by performing a round of pooling on all views

  • We work: We propose a group convolution method for multi-view aggregation , i.e., convolution on a discrete subgroup of the rotation group, enabling joint inference over all views in an equivariant (rather than invariant) manner, until the last layer.

 Viewpoint Permutation Invariance means that when processing three-dimensional data (such as point clouds and 3D models), the model is invariant to changes in different viewing angles or observation angles . In point cloud processing, since the order and arrangement of points of a point cloud may change under different viewing angles, maintaining invariance to these arrangement changes is crucial to achieve robust feature extraction and analysis.

View arrangement invariance is very important for many tasks in point cloud processing, such as point cloud classification, segmentation, object detection, etc. Implementing view arrangement invariance can prevent the model from only learning features from a specific perspective, allowing the model to better generalize to point cloud data from different perspectives.

The following are some methods and ideas that can help achieve view arrangement invariance :

1. Capture the characteristics of point clouds at different viewing angles and maintain equivariance on the spherical surface.

2. Design a rotation-invariant feature extraction method to ensure that point cloud features remain consistent under different viewing angles.

3. During training, increase to help the model learn features from different perspectives.

4. Fusion of features extracted from different perspectives to generate a more comprehensive feature representation.

5. **Point cloud alignment**: Align the point cloud before training to make the point correspondences under different viewing angles more consistent.

Multi-view aggregation: integrating information from multiple perspectives (or multiple inputs)

Joint Reasoning Over All Views : This method allows joint reasoning across all views, which means that the model is able to consider information from different views and maintain this multi-view information when processing the data.

A discrete subgroup of a rotation group refers to a subset of the rotation group that contains a set of discrete rotation operations. A common example is in 3D space, using a discrete rotation operation on the Z axis to form a discrete subgroup. This means that we only consider operations that rotate by a certain angle around the Z axis, and do not consider rotations about other axes. This subgroup is discrete because we only consider some specific rotation angles instead of all possible continuous rotations.

The rotation group is a continuous, infinite group that contains all possible continuous rotation operations. However, when we consider computational or discrete problems, sometimes a subset of the rotation group is used to simplify the problem or perform calculations.

The SO(3) rotation group consists of all rotation operations that keep the origin in three-dimensional space stationary. These operations can be represented by a three-dimensional rotation matrix, which includes rotation about any axis. The elements of the rotation group can be expressed as a 3x3 orthogonal matrix with the property that the special determinant is equal to 1.

 

  • Introduction

With the proliferation of large-scale object 3D datasets [39, 3] and whole scene datasets [2, 8] , deep learning models can be trained to generate global descriptors that can be used for classification and retrieval tasks .

Train deep learning models to generate global descriptors that can be used for classification and retrieval tasks

The first challenge that arises is how to represent the input . Despite numerous attempts at volumetric [39, 24], point cloud [27, 32] and mesh-based [23, 26] representations, using multiple views of a 3D input can switch to the 2D domain, where All recent image-based deep learning breakthroughs (e.g. [15]) can be directly applied in the dimensional domain, thus promoting state-of-the-art performance [33, 20]. 

A multi-view (MV) based approach requires some form of view pooling, which can be

(1) Pixel-wise pooling on some intermediate convolutional layers [33],

(2) Pooling on the final 1D view descriptor [34],

(3) Combine the final logits [20], which can be regarded as independent voting. These operations are generally invariant to viewing arrangements.

 

Our main point is that traditional view pooling, performed before any joint processing of viewsets , inevitably discards useful features, resulting in substandard descriptors. To solve this problem, we first realize that each view can be associated with an element of the rotation group SO(3) , so the natural way to combine multiple views is as a function on the rotation group .

  • Traditional view pooling is performed before any joint processing of the viewset , inevitably discarding useful features, resulting in substandard descriptors
  • Each view can be associated with an element of the rotation group SO(3) , so the natural way to combine multiple views is as a function on the rotation group .

We adopt a traditional CNN to obtain the view descriptors that make up this function. We design a group convolutional network (G-CNN, inspired by [5]) to learn representations that are equivariant to group transformations. By pooling the last G-CNN layer, we obtain invariant descriptors useful for classification and retrieval . Our G-CNN has locally supported descriptors on groups, and as the number of layers increases and the receptive field expands, more complex hierarchical descriptors can be learned.

We exploit the finiteness of multiple views and consider finite rotation groups such as icosahedrons, unlike [6, 10] which operate on continuous groups. To reduce the computational cost of processing one view per group element, we show that by considering canonical coordinate views related to an in-plane dilated rotation group (log-polar coordinates) we can significantly reduce the number of views and obtain a homogeneous space ( H space) that can be boosted by correlation while maintaining arithmetic relations. 

We focus on 3D shapes, but our model is applicable to any task where multiple views can represent the input, as shown in experiments on panoramic scenes.

Equivariant Features (Equivariant Features) means that under a certain transformation of the input data, the features are also transformed in a certain way . In computer vision and deep learning, equivariance is an important property, especially when dealing with data with transformation symmetry, such as images, point clouds, and 3D models.

Equivariant features are useful for preserving the transformed nature of the input data, as they are better able to capture the key features of the data, thereby improving the model's generalization ability and performance. For example, for 3D point cloud data, equivariant features can maintain corresponding feature changes when the data is rotated, translated, etc., so that the model can better adapt to different viewing angles and transformations.

In point cloud processing, the realization of equivariant features involves some specialized methods and technologies, such as:

1. **Rotation equivariance**: By designing the neural network architecture, when the network input data is rotated, the features will also rotate accordingly, thereby achieving rotation equivariance.

2. **Spherical CNNs**: A network used to process spherical data (such as spherical point clouds), which can maintain rotational variability on the spherical surface, thereby extracting meaningful features from different perspectives of the point cloud. Characteristics.

3. **Transformation matrix-based operations**: Use transformation matrices to define transformations of point clouds, and then incorporate these transformation operations in neural networks to capture equivariant features.

4. **Group CNNs**: Design the network structure to be equivariant under specific group (such as rotation group) transformations, so that it can handle transformation symmetry data.

Implementing equivariant features often requires in-depth mathematical and geometric knowledge to ensure that the model captures and represents features correctly as the data is transformed. This is especially important when dealing with irregular data such as point clouds, which do not have a fixed structure like images and require special processing methods to achieve equivariance.

Group Convolution is an operation in a convolutional neural network (CNN) that is used to process data with certain symmetry or structure. Group convolution preserves the specific symmetry of the input data to a certain extent , allowing it to capture the characteristics of the data more effectively.

In group convolution, the convolution kernel is divided into multiple groups (groups), and the convolution kernel in each group is only convolved with the input channel in the corresponding group. This grouping operation helps achieve specific equivariance, allowing the model to better handle data with transformational symmetries.

For example, when processing RGB images, the three color channels (red, green, and blue) can be divided into different groups, and then convolution operations can be performed within each group. This operation preserves the symmetry between color channels, which helps to extract information about color features.

In point cloud processing, group convolution can also be applied. If the point cloud data has a certain structure or symmetry, the point cloud can be divided into different groups and then a convolution operation can be applied within each group to maintain the equivariance of the data.

The advantages of group convolution include:

1. **Reduce parameters and calculation amount**: Since the convolution kernels are grouped, group convolution can reduce the number of parameters and calculation amount, thereby speeding up training and inference to a certain extent.

2. **Maintain specific symmetry**: Group convolution can help the model capture the specific symmetry or structure of the input data, thereby improving the performance of the model.

3. **Reduce overfitting**: Grouping operations can limit parameter sharing within each group, helping to reduce the risk of overfitting.

It should be noted that group convolution is suitable for some data with specific symmetry or structure, but not for all situations. When designing the network architecture, you need to decide whether to use group convolution based on the characteristics of the data and the requirements of the task.

 

Figure 1 illustrates our model. Our contributions are: 

  • We introduce a novel way to aggregate multiple views , whether it's an "outside-in" view of a three-dimensional shape, or an "inside-out" view of a panoramic view. Our model exploits the underlying group structure, resulting in equivariant features that are a function of rotating groups.
  • We introduce a method that can reduce the number of views while maintaining asymmetry by transforming to canonical coordinates via in-plane rotation, followed by homogeneous spatial convolution.
  • We explore finite rotation groups and homogeneous spaces, and propose a discrete G-CNN model on the largest group to date, the icosahedral group. We further explore the concept of filter localization for this group.
  • We achieve state-of-the-art performance on multiple shape retrieval benchmarks, both under canonical pose and rotation perturbations, and show application to panoramic scene classification

 

Figure 1: Our equivariant multi-view network aggregates multiple views as a function on rotating groups and processes them via group convolutions . This guarantees equiskedasticity of three-dimensional rotations and allows joint inference across all views, resulting in superior shape descriptors. Vector-valued functions on the icosahedron group are shown on the five-sided dodecahedron, and the corresponding functions on homogeneous space (H-space) are shown on the dodecahedron and icosahedron. Each view is first processed by a CNN, and the resulting descriptor is associated with a group (or H-space) element. When a view is identified as an H-space, the first operation is to promote features to group correlations. Once we have an initial representation of the group, we can apply the group CNN.

  • Related work

3D shape analysis

The performance of 3D shape analysis depends heavily on the input representation . The main representations are volumes, point clouds and multi-views.

Early examples of volumetric methods are [3], which introduced the ModelNet dataset and trained a 3D shape classifier using a deep belief network based on voxel representation; and [24], which proposed a method with 3D convolutional layers and fully connected layers. standard structure.

Su et al. [33] realized that the capabilities of image-based CNNs can be transferred to 3D tasks by rendering multiple views of the 3D input. They show that traditional CNNs can outperform volumetric methods even when using only a single view of the input, and that multi-view (MV) models further improve classification accuracy.

Qi et al. [28] studied volumetric and multi-view methods and proposed improvements to both; Kanezaki et al. [20] introduced a MV method that achieves state-of-the-art classification by jointly predicting categories and poses performance, but without explicit pose supervision.

GVCNN [12] attempts to learn how to combine different view descriptors to obtain a view group shape representation ; they refer to arbitrary combinations of features as “groups”. This is different from our use of the term "group" which is defined algebraically

Point cloud based methods [27] achieve intermediate performance between volumetric and multi-view, but are more computationally efficient. Although grids are arguably the most natural representation and are widely used in computer graphics, learning models that operate directly on grids have only had limited success [23, 26].

To better compare 3D shape descriptors, we will focus on retrieval performance. Recent methods have shown significant improvements in retrieval: You et al. [41] combined point cloud and MV representation; Yavartanoo et al. [40] introduced multi-viewpoint chi-flat projection; and Han et al. [14] implemented a recursive MV method.

We also consider rotating ModelNet and the more challenging task of SHREC'17 [29] retrieval challenge involving rotated shapes. The existence of arbitrary rotations motivates the use of equivariant representations.

equivariant representation

To handle 3D shapes with arbitrary orientations, many workarounds have been introduced. Typical examples include training-time rotation augmentation and/or test-time voting [28], and learning an initial rotation to a standard pose [27]. The view pooling in [33] is invariant to the arrangement of the input view set.

The principle way to handle rotations is to use a representation designed to be equivariant . There are three main methods of embedding equal variance into CNN.

The first way is to constrain the filter structure, which is similar to methods based on Lie generators [30, 17]. Worral et al. [38] used circular harmonics to introduce both translation and 2D rotation equal variance into CNN. Similarly, Thomas et al. [35] introduced tensor fields to maintain translational and rotational equivariance of 3D point clouds.

The second way is through the change of coordinates; [11, 18] perform a log-polar transformation on the input and convert the rotation and scaling equivariances about a single point into translation equivariances.

The third method is to use equivariant filtering of orbits . Cohen and Welling proposed group convolutions (G-CNNs) [5] using square rotation groups , which were later extended to hexagons [19]. Worrall and Brostow [37] proposed CubeNet using Klein four groups on 3D voxelized data. Winkels et al. [36] implemented a three-dimensional group convolution on volumetric CT images on the octahedral symmetry group. Cohen et al. [7] recently considered functions on icosahedrons , but their convolutions were on cyclic groups rather than on icosahedrons like ours. Esteves et al. [10] and Cohen et al. [6] focus on the infinite group SO(3) and use spherical harmonic transformation to accurately implement spherical convolution or correlation. The main problem with these methods is that the input spherical representation cannot capture the complexity of the object shape; they are also less efficient and face bandwidth challenges.

  • Preliminaries

We seek to exploit symmetries in the data. Symmetry is an operation that preserves some structure of an object. If the object is a discrete collection with no additional structure, each operation can be viewed as a permutation of its elements. 

The term group is used for the classical algebraic definition of sets whose operations satisfy closure, associativity, identities, and inversion properties. Transformation groups like permutations are “the missing link between abstract groups and the concept of symmetry” [25].

We refer to a view as an image taken from a directional camera . This differs from the viewpoint referenced to the direction of the optical axis, which is from outside to inside for a mobile camera pointed at a fixed object, or from inside to outside for a fixed camera pointed in a different direction. Multiple views can be taken from the same viewpoint; they are related by in-plane rotation.

From the outside in: for fixed cameras pointing in different directions

Inside-out: Capturing multiple views from the same viewpoint 

Equivariance

Designing equivariant representations is an effective way to exploit symmetries. Consider a set X and a transformation group G. Consider a set X and a transformation group G.

Guess you like

Origin blog.csdn.net/peng_258/article/details/132591021