PyTorch training

Brief introduction

Prior to column two articles, I introduced the preparation and construction of the model data, the next step is to complete the construction of the model train model optimization, training is completed models for practical applications.

Loss function

Loss function used to measure the error between the predicted value and the target value, by minimizing the loss function to optimize target model. Different loss functions different measure its effects are not necessarily the best for accuracy and design. For many common PyTorch loss function package, are in the torch.nnlower module, similar to the methods for their use, examples of the computation object loss, then the loss of the predicted value and the target value can be calculated instantiated object.

L1 loss
- nn.L1Loss(reduction)
- I.e. L1 calculated absolute error loss.
- reduce parameter indicates whether to return a scalar, default return scalar, otherwise the same dimensional tensor.
- size_average parameter indicates whether to return to the scalar mean default mean, otherwise summation results.
- reduction Substituted above two parameters mean, sumand the Nonevalues corresponding to the above results.
- The following code demonstrates the loss calculation process.
```
import torch
from torch import nn
pred = torch.ones(100, 1) * 0.5
label = torch.ones(100, 1)

l1_mean = nn.L1Loss()
l1_sum = nn.L1Loss(reduction='sum')

print(l1_mean(pred, label))
print(l1_sum(pred, label))
```
MSE loss
- nn.MSELoss(reduction='mean')
- Calculate the mean square error, commonly used in the regression.
- Parameters above.
CE loss
- nn.CrossEntropyLoss(weight=None, ignore_index=-100, reduction='mean')
- Calculate cross-entropy loss, commonly used in classification. Not a standard cross-entropy, but combined the results Softmax, that the results will be first softmax calculated as probability distributions.
- weight parameters are right for each category, to solve the problem of sample imbalance.
- reduction parameters loss function similar to the above.
- ignore_index parameter indicates ignore a category, not counting their losses.
KL tide
- nn.KLDivLoss(reduction='mean')
- KL divergence calculated.
- Parameters above.
Half cross entropy
- nn.BCELoss(reduction='mean')
- Calculating two division cross entropy loss, it is generally used for binary classification.
Binary logic cross entropy
- nn.BCEWithLogitsLoss()
- After the input to sigmoid conversion loss then calculated, similar to the loss CE.

These are just a few commonly referred to simple loss of function, more complex can view the official document, a total package of nearly 20 losses, of course, you can also customize the loss of function that returns a scalar or tensor can (in fact these losses function is so dry).

Optimizer

Data, models, loss of function are determined, the depth model that task is already completed more than half, the next step is to choose the right model for the optimizer to optimize training.

First, to understand the mechanism PyTorch optimizer, the optimizer all Optimizer are inherited from the class that encapsulates the basic methods such as set state_dict(), load_state_dict()and the like.

Parameter set (param_groups)

Any optimizer has a property param_groups, because the parameters of the optimizer is managed based groups and to configure specific parameters for each group learning rate, momentum ratio, decay rate, etc., a list of the property, which a plurality of dictionaries, and the parameters corresponding to different configurations.

The following code example, only one group.

import torch
import torch.optim as optim


w1 = torch.randn(2, 2)
w2 = torch.randn(2, 2)

optimizer = optim.SGD([w1, w2], lr=0.1)
print(optimizer.param_groups)

Here Insert Picture Description

Gradient cleared

Indeed, PyTorch not compute the gradient obtained before optimization after a clearing completion, so after the completion of manually cleared, i.e. each call optimizer requires optimization zero_grad()method.

Add parameters

By calling the optimizer add_param_group()can add a customized set of parameters method.

Common optimizer

PyTorch These optimization algorithms are encapsulated in torch.optim next module, in fact, now been changes to the original papers, specifically see the source code.

Stochastic gradient descent
- optim.SGD(params, lr, momentum, weight_decay)
- Stochastic gradient descent optimizer.
- params parameter indicates the parameter set to be managed.
- lr parameter represents the initial learning rate, the rate can be adjusted as needed to learn.
- momentum parameter indicates the magnitude of the momentum SGD is movable, generally 0.9.
- weight_decay parameter indicates the attenuation coefficient of the weight, but also regular L2 coefficient.
Random average gradient descent
- optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)
- Adam optimization algorithm implementation.
- Similar to the above parameters.

The following illustration shows the effect of the convergence of various algorithms under the same situation.

Here Insert Picture Description

Learning rate adjustment strategy

Appropriate learning rate can make rapid convergence model, which is the original intention of Adam algorithms, usually when we will start to train a larger learning rate, along with the training of a gradual decline in the rate of learning. So when cut, how much lower, related question is the learning rate adjustment strategy, PyTorch provides six strategies for use in that they are torch.optim.lr_schedulerin, divided into orderly adjustment (more rigid), adaptive adjustment (more flexible) and biasing (for each case).

Here are the most commonly used automatic learning rate adjustment mechanism. It is packaged optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001,threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-8).

When the index did not change the learning rate that is adjusted, it is a very useful learning rate adjustment strategy. For example, when the validation set loss that is not about to fall into over-fitting, learning rate adjustment.

mode parameter is composed of two minand max, when the indicator is no longer adjusted low or high.
factor parameter represents the learning rate adjustment ratio.
patience parameter indicates waiting patiently, a step index unchanged when the patience of adjusting the learning rate.
verbose parameter indicates whether to adjust the learning rate is visible.
cooldown parameter indicates the cooling time, after the adjustment of the cooling time is no longer adjusted.
min_lr parameter represents the lower limit of the learning rate.
eps parameter indicates the minimum attenuation rate of the learning, the learning rate is smaller than the change value is not adjusted.

Combat training process

The following code demonstrates the imported data, modeling, optimization, and loss of function uses the optimizer entire process, most of the time we use the model train is PyTorch this idea .

import torch
from torch import nn
import torch.nn.functional as F
from torch import optim


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64*54*54, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 101)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = x.view(-1, 64*54*54)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
x = torch.randn((32, 3, 224, 224))
y = torch.ones(32, ).long()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9, dampening=0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

epochs = 10
losses = []
for epoch in range(epochs):

    correct = 0.0
    total = 0.0

    optimizer.zero_grad()
    outputs = net(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    scheduler.step()
    _, predicted = torch.max(outputs.data, 1)
    total += y.size(0)
    correct += (predicted == y).squeeze().sum().numpy()
    losses.append(loss.item())
    print("loss", loss.item(), "acc", correct / total)

import matplotlib.pyplot as plt
plt.plot(list(range(len(losses))), losses)
plt.savefig('his.png')
plt.show()

FIG follows loss varying their training, since the demo data is only given training converge quickly, a 100% accuracy rate.

Here Insert Picture Description

Supplement

This article describes the PyTorch optimize processes and use optimizer loss of function, which is the final step in depth model of training, the more important. All code in this article are open from my Github , welcome star or fork.

Zhou Xiansen love vegetarian blog expert

Published 234 original articles · won praise 148 · Views 140,000 +

Private letter concerns

PyTorch- model training

PyTorch training

Brief introduction

Loss function

Optimizer

Parameter set (param_groups)

Gradient cleared

Add parameters

Common optimizer

Learning rate adjustment strategy

Combat training process

Supplement

Guess you like