读《Improved Deep Hashing with Soft Pairwise Similarity for Multi-label Image Retrieval》

introduction

Existing traditional hashing is considered as long as at least one tag matches, so in the example shown in the figure, both ab and ac are considered to match,
insert image description here
which makes it impossible for top1 to sort the similarity of multi-label pairs, so the semantics held by each image Labels propose a soft definition of pairwise similarity. Specifically, pairwise similarities are quantified as percentages using normalized semantic labels. (contribution: 1)

So two similarities are proposed. The hard similarity considers that all labels match, so cross-entropy learning is used; the soft similarity considers partial label matching, so the mean square error is used (contribution 2)

related work

A simple approach to deep hash learning is to directly threshold high-level features, typified by DLBHC [46], which learns class Hash representation. While the network is well fine-tuned on the classification task, the features of the latent hash layer
are considered discriminative, which indeed show better performance than hand-crafted features.(也就是说deephash比特征工程好,但相比起直接deep,直接Alexnet或者传统hash效果咋样呢)

For multi-label retrieval, DSRH [25] tries to utilize ranking information of multi-level similarities to learn a hash function, and proposes a proxy loss to solve the optimization problem of ranking measures. IAH [47] focuses on learning instance-aware image representations and uses a weighted triplet loss to maintain similarity rankings for multi-label images. However, the weighted triplet loss functions employed by DSRH [25] and IAH [47] do not enforce direct constraints on learning fine-grained multi-level semantic similarity , since they focus on maintaining the correct ranking of images according to their similarity to the query(也就是说一直在纠结于损失函数,而没有针对多标签相似度的痛点来解决问题吧)

Based on this, DMSSPH [48] tries to construct a hash function to maximize the discriminability of the output space in order to preserve the multi-level similarity between multi-label images. Although DMSSPH [48] has exploited fine-grained multi-level semantic similarity for pairwise similarity learning, there is still room for further exploration. A novel and effective TALR approach is proposed in [36], which considers bound rankings over integer-valued Hamming distances and directly optimizes the ranking-based evaluation metrics Average Precision (MAP) [49] and normalized Discounted Cumulative Gain (NDCG) [50]. It achieves high performance on several benchmark datasets. In [51], two new protocols for evaluating supervised hashing methods are proposed in the context of transfer learning.

In this paper, we explore the diversity of pairwise semantic similarity on multi-label datasets to improve hash quality. Specifically, fine-grained pairwise similarity values ​​are defined in a continuous form (将离散的汉明距离改成连续值?). So the pairwise similarity is divided into two cases, and a joint pairwise loss function is constructed to simultaneously perform feature learning and hash code generation.

method

In order to examine the multi-label similarity, the quantitative label is a continuous value percentage, that is, the cosine similarity of the semantic label vectors of the two images (this paper is the first to use the cosine distance to quantify the fine-grained semantic similarity of paired images)
insert image description here

The image passes through Alexnet, and the output of the final fc8 layer is mapped to (-1,1) through the following activation function(提到本文的Alexnet是可以随意替换成vgg、Googlenet等,所以为啥这俩更新的网络不如初号机有啥说道吗)

insert image description here

hard similarity

insert image description here
where Ω is the inner product of the two hash codes

insert image description here

soft similarity

insert image description here

joint learning

In order to learn both cases simultaneously and form a unified form, Mij is used to label the two cases, where Mij = 1 indicates the "hard similarity" case and Mij = 0 indicates the "soft similarity" case. Therefore, the pairwise similarity loss is rewritten as:(这不其实就是那俩损失函数结合嘛)
insert image description here

Directly optimizing the equation is challenging. Because the binary constraint bi ∈ {−1, 1} q requires thresholding the network output, this can lead to a vanishing gradient problem in backpropagation during training.

Scaled pairwise quantization loss
insert image description here
The final loss is C = L + λQ

experiment

Performance

Average cumulative gain (ACG) [60], normalized
discounted cumulative gain (NDCG) [50], mean

Average Precision (MAP) [49] and Weighted Average Precision (WAP) [25].

Regarding the hash loss function, it maps high-dimensional data to low-dimensional binary codes to improve the speed and efficiency of data retrieval.
The input consists of four parameters:

  • D: Represents the eigenvector matrix of the sample, with shape (batch_size, feature_dim).
  • label: Represents the label vector matrix of the sample, the shape is (batch_size, num_class), where num_classrepresents the number of categories.
  • alpha, beltaand gama: represent the weight coefficients of the three loss items.
  • m: Indicates the length of the hash code.

Specifically, this hash loss function consists of three parts: cosine similarity loss , hash code length constraint , and regularization term . Among them, the cosine similarity loss can match the hash code by comparing the cosine similarity between the sample pairs,
first calculate the cosine similarity matrix of the label

label_count = tf.expand_dims(tf.sqrt(tf.reduce_sum(tf.square(label), 1)),1)
# 标签向量的模长
norm_label = label/tf.tile(label_count,[1,args.num_class])
# 标签向量的单位向量
w_label = tf.matmul(norm_label, norm_label, False, True)
# 标签向量之间的余弦相似度矩阵
semi_label = tf.where(w_label>0.99, w_label-w_label,w_label)
# 将大于阈值0.99的相似度设置为0后的相似度矩阵

Then calculate the cosine similarity of the samples

p2_distance = tf.matmul(D, D, False, True)

The hash code length constraint can ensure that the length of the hash code does not exceed the specified value. In the implementation, we need to calculate the hash code of the sample and compare it with the specified hash code length to get the hash code length constraint loss.

scale_distance = belta * p2_distance / m
# 对距离矩阵进行缩放后的值
temp = tf.log(1+tf.exp(scale_distance))
loss = tf.where(semi_label<0.01,temp - w_label * scale_distance, gama*m*tf.square((p2_distance+m)/2/m-w_label))
regularizer = tf.reduce_mean(tf.abs(tf.abs(D) - 1))
d_loss = tf.reduce_mean(loss) + alpha * regularizer 


In this way, the C=L+αQ hash code length constraint in the paper can ensure that the length of the hash code does not exceed the specified value, and the regularization term can help the model prevent overfitting.

The output of this function consists of two values:

  • d_loss: Indicates the total hash loss value.
  • w_label: Represents the cosine similarity matrix between tags, with a shape of (batch_size, batch_size).

In the function implementation, the sample labels are first standardized, and then the cosine similarity matrix between the labels is calculated, and the similarity greater than the threshold is set to 0. Next, calculate the cosine similarity matrix between samples, and map the distance matrix to a value range between 0 and 1. Finally, three loss items are calculated and their weighted sum is taken as the total hash loss value.

main

The role of this code is to read the data in the tfrecord file, build the AlexNet model, calculate the hash loss ( d_loss), and use the optimizer for training.

Specifically, first reader.read_and_decoderead the data from the tfrecord file through the function (这个在tf2版本里要大改了,需要换成data相关函数), and get a set of images ( img) and their corresponding labels ( label). Then use tf.train.shuffle_batchthe function to shuffle the read images and labels to form a args.batch_sizebatch of size , which is used to train the model.

Next, use AlexNetthe function to build the AlexNet model and take the image data in the batch as input and get an output D. This output contains the hash code corresponding to each image.(哈希码的维度由num_bits控制)

Then, use hashing_lossthe function to calculate the hash loss, and take the output value Dand label value label_batchas input parameters. Among them, args.alpha, args.beltaand args.gamaare hyperparameters, which control the weight of similarity loss, hash code length constraint and regularization term, respectively.

Finally, the computed hash loss ( d_loss) and model output ( out) are returned.

Then optimize the training process:

  1. According to the specified skip_layers, all trainable variables are divided into two categories: var_list1 and var_list2. Among them, var_list1 includes all variables that need fine-tuning, and var_list2 includes all variables that need to be trained from scratch.
  2. Define learning_rate, and set exponential decay.
  3. Define two Adam optimizers: opt1 and opt2. Among them, the learning rate of opt1 is learning_rate*0.01, which is used to optimize the variables in var_list1; the learning rate of opt2 is learning_rate, which is used to optimize the variables in var_list2.
  4. Computes the gradient grads for all variables and splits the grads into two parts: grads1 and grads2 according to var_list1 and var_list2.
  5. Use opt1 and opt2 to optimize the gradients in grads1 and grads2 respectively, and update global_step.
  6. Combine the update operations of the two optimizers into one train_op.

Therefore, the training process of the entire model is realized, and the gradients of the two types of variables are updated through different optimizers, thereby realizing two different training methods of fine-tuning and training from scratch.

In the training loop, the Node in the TensorFlow calculation graph (Graph) is executed through the session (Session) object.

First, a Saver object is defined to save the trained model. Next, use sess.runthe method to initialize global and local variables, and to load the pretrained model weights into the network. Then, start the dataset queue thread (这个应该只能用在tf1中,tf2就尬住了)and enter the training loop.

In the training loop, sess.runthree nodes are run through the method, namely train_op(training node), d_loss(loss node) and global_step(global steps node). Among them, train_opis the operation node that applies the calculated gradient to the variable, and the return value is None; d_lossit is the node that calculates the loss, and the return value is a scalar; global_stepit is a variable, and its value is increased by one every time the training node is executed.

During the training process, it is step1 % 10 == 0controlled by to output training information every 10 iterations, including the current iteration number (step1), loss value (loss_t) and time-consuming (elapsed_time). Use to control saving the model step1 % args.save_freq == 0every iteration times. args.save_freqWhen all samples in the data set have been traversed, the training ends. Finally, stop the queue thread and exit the session.

Regarding alexnet, the convolution method is similar to caffe. When the groups are equal to 1, the convolution operation is performed directly; when the groups are greater than 1, the input and the convolution kernel are grouped according to the number of groups, and the convolution operation is performed separately, and finally the result is combined. The final output results are processed by bias, ReLU activation, etc. (groups: the number of groups for group convolution, the default is 1)

Guess you like

Origin blog.csdn.net/weixin_40459958/article/details/130647603