A commonly used method in model parameter adjustment is to set different learning rates for different layers to avoid overfitting and other problems caused by inconsistent difficulty levels.
1. Model examples
class Model(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(self, Model).__init__()
self.linear_in = nn.Linear(input_size, hidden_size)
self.global_gru = DynamicLSTM(hidden_size, hidden_size, bidirectional=True, rnn_type='GRU')
self.linear_out = nn.Linear(2 * hidden_size, output_size)
self.dropout = nn.Dropout(0.3)
def forward(self, x, length):
x = self.linear_in(x)
x = self.global_gru(x, length)
x = F.relu(self.linear_out(x))
x = self.dropout(x)
return x
Suppose now we need to set a larger learning rate for the GRU layer and a smaller learning rate for the remaining layers of the model. What should we do?
2. Setting method
model = Model(1024, 400, 512)
# 获取模型GRU层参数ID
gru_layer = torch.nn.ModuleList([model.global_gru])
gru_layer_params = list(map(id, gru_layer.parameters()))
# 获取模型其余层参数ID
rest_layers_params = filter(lambda p: id(p) not in gru_layer_params, model.parameters())
# 使用Adam优化器, weight_decay统一设定为0.00001
optimizer = Adam([{"params": model.global_gru.parameters(), "lr": 0.0002},
{"params": rest_layers_params, "lr": 0.00005}],
, weight_decay=0.00001)
This method can usually also be used in multi-task learning. Setting different learning rates for different tasks can make the model achieve better results.