How to set different learning rates for different layers of the model?

A commonly used method in model parameter adjustment is to set different learning rates for different layers to avoid overfitting and other problems caused by inconsistent difficulty levels.

1. Model examples

class Model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(self, Model).__init__()
        self.linear_in = nn.Linear(input_size, hidden_size)
        self.global_gru = DynamicLSTM(hidden_size, hidden_size, bidirectional=True, rnn_type='GRU')
        self.linear_out = nn.Linear(2 * hidden_size, output_size)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x, length):
        x = self.linear_in(x)
        x = self.global_gru(x, length)
        x = F.relu(self.linear_out(x))
        x = self.dropout(x)
        return x

Suppose now we need to set a larger learning rate for the GRU layer and a smaller learning rate for the remaining layers of the model. What should we do?

2. Setting method

model = Model(1024, 400, 512)

# 获取模型GRU层参数ID
gru_layer = torch.nn.ModuleList([model.global_gru])
gru_layer_params = list(map(id, gru_layer.parameters()))

# 获取模型其余层参数ID
rest_layers_params = filter(lambda p: id(p) not in gru_layer_params, model.parameters())

# 使用Adam优化器, weight_decay统一设定为0.00001

optimizer = Adam([{"params": model.global_gru.parameters(), "lr": 0.0002},
                  {"params": rest_layers_params, "lr": 0.00005}],
                   , weight_decay=0.00001)

This method can usually also be used in multi-task learning. Setting different learning rates for different tasks can make the model achieve better results.

Supongo que te gusta

Origin blog.csdn.net/weixin_45684362/article/details/132251692
Recomendado
Clasificación