[Li Hongyi 2022 Machine Learning Spring] hw7_BERT (accounting for pits)

Grading

insert image description here

Experimental record

medium

Hyper parameter: max_question 40/ max_paragraph 350/ doc_stride 300

##### TODO: Apply linear learning rate decay #####
learning_rate = learning_rate * (1.0 / (1.0 + 0.00001 * step))

The drawn lr curve:
insert image description here

training skills

fp16_training

insert image description here

Official example:
insert image description here

Gradient accumulation

from: https://kozodoi.me/python/deep%20learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html

# batch accumulation parameter
accum_iter = 4  

# loop through enumaretad batches
for batch_idx, (inputs, labels) in enumerate(data_loader):

    # extract inputs and labels
    inputs = inputs.to(device)
    labels = labels.to(device)

    # passes and weights update
    with torch.set_grad_enabled(True):
        
        # forward pass 
        preds = model(inputs)
        loss  = criterion(preds, labels)

        # normalize loss to account for batch accumulation
        loss = loss / accum_iter 

        # backward pass
        loss.backward()

        # weights update
        if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1 == len(data_loader)):
            optimizer.step()
            optimizer.zero_grad()

reward:

Train faster: fp16_training
larger batch: Gradient accumulation

Linear learning rate drop:

from transformers import get_linear_schedule_with_warmup # 在https://huggingface.co/transformers/下,不在pytorch官网
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps= 0, # Default value
                                                num_training_steps=total_steps) # 把num_warmup_steps=0就可以实现线性下降

See the Hugging Face library again: https://huggingface.co/

Guess you like

Origin blog.csdn.net/weixin_43154149/article/details/124417296