2. Siamese Neural Networks for One-shot Image Recognition paper detailed reading-Part Ⅱ (model structure and training settings)

⭐Original link: Siamese Neural Networks for One-shot Image Recognition (cmu.edu)

This part mainly explains the mathematical structure and training methods of the model, which can be sorted out and understood by referring to the code (the code will be explained in detail in Part IV).

*Note: Some letters in the text may be partially deviated due to formatting issues, which can be understood based on the position of the letters in a specific paragraph (such as decimal points displayed as boxes, etc.).

*Black words - original translation

*Red letter - there is a problem

*Blue text - Advantages

*Green text - subjective analysis (not all details are analyzed, only supplementary to the content of this article)

3. Deep Siamese Networks for Image Verification

In the early 1990s, Bromley and LeCun first introduced Siamese nets to solve signature verification of image matching problems.

A siamese neural network consists of twin networks that accept different inputs but are connected by an energy function on top. This function calculates some metric between the highest-level feature representations on each side (Figure 3).

Figure 3. A simple 2 hidden layer siamese network for binary classification of logistic prediction p. The structure of the network is replicated in the top and bottom parts to form a dual network with shared weight matrices in each layer.

The parameters between the dual networks are tied. Weight tying ensures that two extremely similar images cannot be mapped by their respective networks to very different locations in feature space, since each network computes the same function.

The network parameters between the two networks in *siamese nets are consistent, ensuring that the features obtained from the two images are consistent.

Furthermore, the network is symmetric, so every time we present two different images to the twin network, the top connected layer will compute the same metric as if we presented the same two images to opposite twins Same.

In the article by LeCun et al., the authors used a contrastive energy function containing dual terms to reduce the energy of like pairs and increase the energy of unlike pairs.

However, in this paper, we use a weighted distance between two feature vectors and combined with a sigmoid activation, which maps to the interval [0,1]. Therefore, cross-entropy objectives are a natural choice for training networks. Note that in the LeCun et al. paper, they directly learn the similarity metric, which is implicitly defined by the energy loss, whereas we fix the metric as specified above, based on the method in Facebook's DeepFace paper.

Our best-performing models use multiple convolutional layers before fully-connected layers and top-level energy functions. Convolutional neural networks have achieved excellent results in many large-scale computer vision applications, especially in image recognition tasks.

Several factors make convolutional networks particularly attractive. Local connectivity can greatly reduce the number of parameters in the model, which essentially provides some form of built-in regularization, although convolutional layers are computationally more expensive than standard nonlinearities.

Furthermore, the convolution operations used in these networks have direct filtering interpretation, where each feature map is convolved against the input features to identify patterns as groupings of pixels.

Therefore, the output of each convolutional layer corresponds to important spatial features in the original input space and provides some robustness to simple transformations. Finally, very fast CUDA libraries are now available for building large convolutional networks without requiring unacceptable amounts of training time.

We now detail the structure of siamese nets and the details of the learning algorithm used in our experiments.

3.1. Model

Our standard model is an L-layer siamese convolutional neural network. Each layer has units, which represents the hidden vector of the first twin in layer l and represents the hidden vector of the second twin. We only use rectified linear (ReLU) units in the first L-2 layer and sigmoidal units in the remaining layers.

The model consists of a series of convolutional layers, each using a filter with a different size and a single channel with a fixed stride of 1. The number of convolution filters is specified as a multiple of 16 to optimize performance. The network applies a ReLU activation function to the output feature map, optionally followed by a filter size and maxpooling with stride 2. Therefore, the kth filter map of each layer has the following form:

Among them , l is a three-dimensional tensor representing the feature map of layer l. We adopt a valid convolutional operation, which corresponds to returning only those output units. These output units are each convolutional filter (convolutional operation). The result of complete overlap between filter and input feature maps.

Units i in the final convolutional layer are flattened into a single vector. This convolutional layer is followed by a fully connected layer, and then another layer that calculates the induced distance metric between each siamese twin, which is a single sigmoidal output unit. More precisely, the prediction vector is where σ is the sigmoidal activation function. The last layer introduces a metric on the learned feature space of the (L−1)th hidden layer and scores the similarity between the two feature vectors. It is the additional parameter learned by the model during the training process, the importance of weighted component-wise distance. This defines the final L-th fully connected layer for the network connecting the two siamese twins.

We describe an example (Fig. 4) that shows the largest version of the model we considered. This network also gave the best results in the verification task.

Figure 4. Optimal convolutional structure selected for verification task. The Sianese twin is not described, but is connected immediately after the 4096-unit fully-connected layer, where the L1 component-wise distance between vectors is calculated.

3.2. Learning

Loss function

M represents minibatch size, where i indexes the i-th minibatch.

Now let be a vector of length m containing the labels of the minibatch. We assume that when x1 and x2 are from the same character class , otherwise . We impose a regularized cross-entropy objective on the binary classifier, which has the following form:

*Here is a common loss function setting method for deep learning. The example where the difference is certain is a comparison pair of two things. The label is whether they are of the same category. The same category is 1, and the different categories are 0.

*Need to understand the calculation formula of the cross-entropy loss function and the addition of regularization terms.

Optimization

This objective is combined with the standard backpropagation algorithm, where gradients are additive over twin networks due to tied weights. We use learning rate , momentum and regularization weights to define them layer-wise, and determine the minibatch size to be 128. Therefore, our update rules at epoch T are as follows:

Where ▽ is the partial derivative of the weight between the j-th neuron of a certain layer and the k-th neuron of the previous layer.

*This is the standard backpropagation algorithm, and the parameters are updated through training.

Weight initialization

We initialize all network weights in the convolutional layers to a normal distribution with zero mean and standard deviation 10^-2. The bias is also initialized from a normal distribution, but with a mean of 0.5 and a standard deviation of 10^-2. In fully connected layers, biases are initialized in the same way as convolutional layers, but the weights are drawn from a wider normal distribution with a mean of zero and a standard deviation of 2*10^-1.

*Here are the initialization settings of network weights and biases, including weights and biases in convolutional layers and fully connected layers.

Learning schedule

尽管我们允许每一层有不同的学习率,但学习率在整个网络中以每个epoch 1%的速度均匀衰减(decayed uniformly)

我们发现,通过退火学习率(annealing the learning rate),网络能够更容易地收敛到局部极小值(converge to local minima),而不会陷入误差曲面(getting stuck in the error surface)。我们固定动量从每一层的0.5开始,increasing linearly each epoch,直到达到值,第j层的单个动量项。

*此处涉及到学习率的理解,以及添加动量的意义。

我们训练每个网络最多200个epochs,但在一组320个oneshot learning tasks上监测one-shot validation error,这些任务是由验证集中的alphabets and drawers随机生成的。

当validation error在20个epochs内没有下降时,我们停止并根据one-shot validation error在最佳epoch使用模型参数。如果validation error在整个学习计划(entire learning schedule)中持续减少,我们将保存此过程生成的模型的最终状态(final state of the model)。

*需掌握一些训练的技巧。

Hyperparameter optimization

我们使用了beta version of Whetlab,一个贝叶斯优化框架(Bayesian optimization framework),执行超参数选择。

*Whetlab用来进行超参数选择。

1) 对于学习进度和正则化超参数,我们设置了layerwise learning rate ,layer-wise momentum 和layer-wise regularization penalty

2) 对于网络超参数(network hyperparameters),我们让卷积滤波器(convolutional filters)的size从3x3变化到20x20,而每层卷积滤波器(convolutional filters)的number使用16的倍数从16变化到256。

3) 全连接层(Fully-connected layers)的范围从128到4096 units,也是16的倍数。

我们将优化器(optimizer)设置为最大化一次验证集(one-shot validation set)的准确性。分配给single Whetlab iteration的score是在任何epoch发现的该指标(metric)的最高值。

Affine distortions

*此处主要是对原始图像进行变化操作(常用的有旋转,翻转等),以增强数据,提高识别效果。

此外,我们用small affine distortions增强训练集(图5)

Figure 5. 为Omniglot数据集中的单个字符生成的随机affine distortions示例。

对于每一对图像,我们生成一对affine transformations 得到),其中由多维均匀分布(multidimensional uniform distribution)随机确定(determined stochastically)。

对于任意变换T,我们有,其中。转换的每一个组成部分的概率都是0.5。


参考文献

Koch, Gregory R.. “Siamese Neural Networks for One-Shot Image Recognition.” (2015).

Guess you like

Origin blog.csdn.net/qq_41958946/article/details/128848966