introduction
Existing traditional hashing is considered as long as at least one tag matches, so in the example shown in the figure, both ab and ac are considered to match,
which makes it impossible for top1 to sort the similarity of multi-label pairs, so the semantics held by each image Labels propose a soft definition of pairwise similarity. Specifically, pairwise similarities are quantified as percentages using normalized semantic labels. (contribution: 1)
So two similarities are proposed. The hard similarity considers that all labels match, so cross-entropy learning is used; the soft similarity considers partial label matching, so the mean square error is used (contribution 2)
related work
A simple approach to deep hash learning is to directly threshold high-level features, typified by DLBHC [46], which learns class Hash representation. While the network is well fine-tuned on the classification task, the features of the latent hash layer
are considered discriminative, which indeed show better performance than hand-crafted features.(也就是说deephash比特征工程好,但相比起直接deep,直接Alexnet或者传统hash效果咋样呢)
For multi-label retrieval, DSRH [25] tries to utilize ranking information of multi-level similarities to learn a hash function, and proposes a proxy loss to solve the optimization problem of ranking measures. IAH [47] focuses on learning instance-aware image representations and uses a weighted triplet loss to maintain similarity rankings for multi-label images. However, the weighted triplet loss functions employed by DSRH [25] and IAH [47] do not enforce direct constraints on learning fine-grained multi-level semantic similarity , since they focus on maintaining the correct ranking of images according to their similarity to the query(也就是说一直在纠结于损失函数,而没有针对多标签相似度的痛点来解决问题吧)
Based on this, DMSSPH [48] tries to construct a hash function to maximize the discriminability of the output space in order to preserve the multi-level similarity between multi-label images. Although DMSSPH [48] has exploited fine-grained multi-level semantic similarity for pairwise similarity learning, there is still room for further exploration. A novel and effective TALR approach is proposed in [36], which considers bound rankings over integer-valued Hamming distances and directly optimizes the ranking-based evaluation metrics Average Precision (MAP) [49] and normalized Discounted Cumulative Gain (NDCG) [50]. It achieves high performance on several benchmark datasets. In [51], two new protocols for evaluating supervised hashing methods are proposed in the context of transfer learning.
In this paper, we explore the diversity of pairwise semantic similarity on multi-label datasets to improve hash quality. Specifically, fine-grained pairwise similarity values are defined in a continuous form (将离散的汉明距离改成连续值?)
. So the pairwise similarity is divided into two cases, and a joint pairwise loss function is constructed to simultaneously perform feature learning and hash code generation.
method
In order to examine the multi-label similarity, the quantitative label is a continuous value percentage, that is, the cosine similarity of the semantic label vectors of the two images (this paper is the first to use the cosine distance to quantify the fine-grained semantic similarity of paired images)
The image passes through Alexnet, and the output of the final fc8 layer is mapped to (-1,1) through the following activation function(提到本文的Alexnet是可以随意替换成vgg、Googlenet等,所以为啥这俩更新的网络不如初号机有啥说道吗)
hard similarity
where Ω is the inner product of the two hash codes
soft similarity
joint learning
In order to learn both cases simultaneously and form a unified form, Mij is used to label the two cases, where Mij = 1 indicates the "hard similarity" case and Mij = 0 indicates the "soft similarity" case. Therefore, the pairwise similarity loss is rewritten as:(这不其实就是那俩损失函数结合嘛)
Directly optimizing the equation is challenging. Because the binary constraint bi ∈ {−1, 1} q requires thresholding the network output, this can lead to a vanishing gradient problem in backpropagation during training.
Scaled pairwise quantization loss
The final loss is C = L + λQ
experiment
Performance
Average cumulative gain (ACG) [60], normalized
discounted cumulative gain (NDCG) [50], mean
Average Precision (MAP) [49] and Weighted Average Precision (WAP) [25].
Regarding the hash loss function, it maps high-dimensional data to low-dimensional binary codes to improve the speed and efficiency of data retrieval.
The input consists of four parameters:
D
: Represents the eigenvector matrix of the sample, with shape(batch_size, feature_dim)
.label
: Represents the label vector matrix of the sample, the shape is(batch_size, num_class)
, wherenum_class
represents the number of categories.alpha
,belta
andgama
: represent the weight coefficients of the three loss items.m
: Indicates the length of the hash code.
Specifically, this hash loss function consists of three parts: cosine similarity loss , hash code length constraint , and regularization term . Among them, the cosine similarity loss can match the hash code by comparing the cosine similarity between the sample pairs,
first calculate the cosine similarity matrix of the label
label_count = tf.expand_dims(tf.sqrt(tf.reduce_sum(tf.square(label), 1)),1)
# 标签向量的模长
norm_label = label/tf.tile(label_count,[1,args.num_class])
# 标签向量的单位向量
w_label = tf.matmul(norm_label, norm_label, False, True)
# 标签向量之间的余弦相似度矩阵
semi_label = tf.where(w_label>0.99, w_label-w_label,w_label)
# 将大于阈值0.99的相似度设置为0后的相似度矩阵
Then calculate the cosine similarity of the samples
p2_distance = tf.matmul(D, D, False, True)
The hash code length constraint can ensure that the length of the hash code does not exceed the specified value. In the implementation, we need to calculate the hash code of the sample and compare it with the specified hash code length to get the hash code length constraint loss.
scale_distance = belta * p2_distance / m
# 对距离矩阵进行缩放后的值
temp = tf.log(1+tf.exp(scale_distance))
loss = tf.where(semi_label<0.01,temp - w_label * scale_distance, gama*m*tf.square((p2_distance+m)/2/m-w_label))
regularizer = tf.reduce_mean(tf.abs(tf.abs(D) - 1))
d_loss = tf.reduce_mean(loss) + alpha * regularizer
In this way, the C=L+αQ hash code length constraint in the paper can ensure that the length of the hash code does not exceed the specified value, and the regularization term can help the model prevent overfitting.
The output of this function consists of two values:
d_loss
: Indicates the total hash loss value.w_label
: Represents the cosine similarity matrix between tags, with a shape of(batch_size, batch_size)
.
In the function implementation, the sample labels are first standardized, and then the cosine similarity matrix between the labels is calculated, and the similarity greater than the threshold is set to 0. Next, calculate the cosine similarity matrix between samples, and map the distance matrix to a value range between 0 and 1. Finally, three loss items are calculated and their weighted sum is taken as the total hash loss value.
main
The role of this code is to read the data in the tfrecord file, build the AlexNet model, calculate the hash loss ( d_loss
), and use the optimizer for training.
Specifically, first reader.read_and_decode
read the data from the tfrecord file through the function (这个在tf2版本里要大改了,需要换成data相关函数)
, and get a set of images ( img
) and their corresponding labels ( label
). Then use tf.train.shuffle_batch
the function to shuffle the read images and labels to form a args.batch_size
batch of size , which is used to train the model.
Next, use AlexNet
the function to build the AlexNet model and take the image data in the batch as input and get an output D
. This output contains the hash code corresponding to each image.(哈希码的维度由num_bits控制)
Then, use hashing_loss
the function to calculate the hash loss, and take the output value D
and label value label_batch
as input parameters. Among them, args.alpha
, args.belta
and args.gama
are hyperparameters, which control the weight of similarity loss, hash code length constraint and regularization term, respectively.
Finally, the computed hash loss ( d_loss
) and model output ( out
) are returned.
Then optimize the training process:
- According to the specified skip_layers, all trainable variables are divided into two categories: var_list1 and var_list2. Among them, var_list1 includes all variables that need fine-tuning, and var_list2 includes all variables that need to be trained from scratch.
- Define learning_rate, and set exponential decay.
- Define two Adam optimizers: opt1 and opt2. Among them, the learning rate of opt1 is learning_rate*0.01, which is used to optimize the variables in var_list1; the learning rate of opt2 is learning_rate, which is used to optimize the variables in var_list2.
- Computes the gradient grads for all variables and splits the grads into two parts: grads1 and grads2 according to var_list1 and var_list2.
- Use opt1 and opt2 to optimize the gradients in grads1 and grads2 respectively, and update global_step.
- Combine the update operations of the two optimizers into one train_op.
Therefore, the training process of the entire model is realized, and the gradients of the two types of variables are updated through different optimizers, thereby realizing two different training methods of fine-tuning and training from scratch.
In the training loop, the Node in the TensorFlow calculation graph (Graph) is executed through the session (Session) object.
First, a Saver object is defined to save the trained model. Next, use sess.run
the method to initialize global and local variables, and to load the pretrained model weights into the network. Then, start the dataset queue thread (这个应该只能用在tf1中,tf2就尬住了)
and enter the training loop.
In the training loop, sess.run
three nodes are run through the method, namely train_op
(training node), d_loss
(loss node) and global_step
(global steps node). Among them, train_op
is the operation node that applies the calculated gradient to the variable, and the return value is None; d_loss
it is the node that calculates the loss, and the return value is a scalar; global_step
it is a variable, and its value is increased by one every time the training node is executed.
During the training process, it is step1 % 10 == 0
controlled by to output training information every 10 iterations, including the current iteration number (step1), loss value (loss_t) and time-consuming (elapsed_time). Use to control saving the model step1 % args.save_freq == 0
every iteration times. args.save_freq
When all samples in the data set have been traversed, the training ends. Finally, stop the queue thread and exit the session.
Regarding alexnet, the convolution method is similar to caffe. When the groups are equal to 1, the convolution operation is performed directly; when the groups are greater than 1, the input and the convolution kernel are grouped according to the number of groups, and the convolution operation is performed separately, and finally the result is combined. The final output results are processed by bias, ReLU activation, etc. (groups: the number of groups for group convolution, the default is 1)