Recently, I have been studying the fine-tuning of tensorflow, and TF's information on fine-tuning is really not as much as Caffe. After a few days of pondering, I almost understand. Thanks to the few open source materials on the Internet.
There are two main methods:
1. Start from the graph
The inspiration for this scheme comes from: https://www.cnblogs.com/sikyadjy/p/6861692.html
Here I re-summarize the original author's three steps:
1) First output the parameter variables of the layer
variables_names = [v.name for v in tf.trainable_variables()]
values = sess.run(variables_names)
for k, v in zip(variables_names, values):
print "Variable: ", k
print "Shape: ", v.shape
print v
2) Set the learning rate of the front and rear layers.
For example, use a lower learning rate to fine-tune the parameters of the first 20 layers (20 layers have about 40 parameters, 20 weights, and 20 bias)
var1 = tf.trainable_variables()[0:40]
var2 = tf.trainable_variables()[40:]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var2)
train_op = tf.group(train_op1, train_op2)
3) Load the pre-trained model for fine-tuning
Note: (Note:)
In my experiments, according to the original author's method, when I loaded the pre-trained ckpt, there was always an error.
To sum up, the reason may appear in the graph. In fine-tuning, two train_op1 and train_op2 are set. However, there is only one train_op in my pre-trained model, which may not correspond here, resulting in an error when loading the weights.
The changes are as follows: It is also set to two train_ops during pre-training, but the learning rate is adjusted to be the same. Retrain the model, and it's ready to load.
Summary: This method obviously has limitations, resulting in the graph of the previous pretrained model must be adjusted according to fine-tuning
2. Start with the reverse gradient
This inspiration comes from: http://blog.csdn.net/liyuan123zhouhui/article/details/69569493
Here is the code from the original author:
#tensorflow 中从ckpt文件中恢复指定的层或将指定的层不进行恢复:
#tensorflow 中不同的layer指定不同的学习率
with tf.Graph().as_default():
#存放的是需要恢复的层参数
variables_to_restore = []
#存放的是需要训练的层参数名,这里是没恢复的需要进行重新训练,实际上恢复了的参数也可以训练
variables_to_train = []
for var in slim.get_model_variables():
excluded = False
for exclusion in fine_tune_layers:
#比如fine tune layer中包含logits,bottleneck
if var.op.name.startswith(exclusion):
excluded = True
break
if not excluded:
variables_to_restore.append(var)
#print('var to restore :',var)
else:
variables_to_train.append(var)
#print('var to train: ',var)
#这里省略掉一些步骤,进入训练步骤:
#将variables_to_train,需要训练的参数给optimizer 的compute_gradients函数
grads = opt.compute_gradients(total_loss, variables_to_train)
#这个函数将只计算variables_to_train中的梯度
#然后将梯度进行应用:
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
#也可以直接调用opt.minimize(total_loss,variables_to_train)
#minimize只是将compute_gradients与apply_gradients封装成了一个函数,实际上还是调用的这两个函数
#如果在梯度里面不同的参数需要不同的学习率,那么可以:
capped_grads_and_vars = []#[(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
#update_gradient_vars是需要更新的参数,使用的是全局学习率
#对于不是update_gradient_vars的参数,将其梯度更新乘以0.0001,使用基本上不动
for grad in grads:
for update_vars in update_gradient_vars:
if grad[1]==update_vars:
capped_grads_and_vars.append((grad[0],grad[1]))
else:
capped_grads_and_vars.append((0.0001*grad[0],grad[1]))
apply_gradient_op = opt.apply_gradients(capped_grads_and_vars, global_step=global_step)
#在恢复模型时:
with sess.as_default():
if pretrained_model:
print('Restoring pretrained model: %s' % pretrained_model)
init_fn = slim.assign_from_checkpoint_fn(
pretrained_model,
variables_to_restore)
init_fn(sess)
#这样就将指定的层参数没有恢复
This operation does not need to define two train_ops, but splits a minimize optimization function into two parts: Optimizer.compute_gradient(), Optimizer.apply_gradients().
The operation is as follows:
1. First calculate the inverse gradient: Optimizer.compute_gradient()
2. Multiply the corresponding gradient of the parameters for which the model is only fine-tuned by a tiny number, reduce the gradient update, and store it through a variable capped_grads_and_vars.
3. Add the gradient to the graph to flow, then Optimizer.apply_gradients().
Compared with the two methods, the next one is obviously much more convenient, of course, it depends on the situation.