Tutorial simplificado de inicio rápido de aprendizaje profundo de Pytorch

El proceso de entrenamiento típico de una red neuronal es el siguiente:

Definir una red neuronal con algunos parámetros (o pesos) que se puedan aprender
Iterar sobre el conjunto de datos de entrada
Entrada de proceso a través de la red
Calcule la pérdida (qué tan lejos es la distancia correcta hasta la salida)
Propagar el gradiente de nuevo a los parámetros de la red.
Por lo general, se usa una regla de actualización simple para actualizar el peso de la red: weight = weight - learning_rate * gradient

Los ejemplos de definiciones de red son los siguientes:

import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Generalmente, cuando tiene que lidiar con datos de imagen, texto, audio o video, puede usar el paquete estándar de Python que carga los datos en una matriz numpy. Entonces puedes convertir esta matriz a torch.*Tensor.

Para las imágenes, Pillow, OpenCV y otros paquetes de software son útiles
Para audio, utilice paquetes como scipy y librosa
Para texto, la carga sin procesar basada en Python o Cython, o NLTK y SpaCy es útil

Específicamente, para la visión, hemos creado un torchvisionarchivo llamado , que tiene un transformador para conjuntos de datos comunes como Imagenet, CIFAR10, MNIST, etc. y datos de imagen, es decir, cargador de datos torchvision.datasetsy torch.utils.data.DataLoader.

Esto proporciona una gran comodidad y evita escribir código repetitivo.

Entrena un clasificador de imágenes

Realizaremos los siguientes pasos en orden:

Utilice los siguientes comandos para cargar y estandarizar el conjunto de datos de prueba y entrenamiento CIFAR10 torchvision
Definir red neuronal convolucional
Definir función de pérdida
Entrene a la red en función de los datos de entrenamiento
Prueba la red con datos de prueba

Utilice numpy para implementar ejemplos de red

# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Numpy es un gran marco, pero no puede usar GPU para acelerar sus cálculos numéricos. Para las redes neuronales profundas modernas, la GPU generalmente proporciona 50 veces o más aceleración que eso , por lo que , desafortunadamente, numpy solo no es suficiente para lograr un aprendizaje profundo moderno.

Aquí, presentamos el concepto más básico de PyTorch: Tensor . Los tensores de PyTorch son conceptualmente los mismos que las matrices numpy: los tensores son matrices de n dimensiones y PyTorch proporciona muchas funciones para operar en estos tensores. Detrás de escena, los tensores pueden rastrear gráficos y gradientes computacionales, pero también pueden usarse como herramientas generales para la computación científica. A diferencia de Numpy, los tensores PyTorch pueden usar GPU para acelerar sus cálculos digitales. Para ejecutar PyTorch Tensor en la GPU, solo necesita convertirlo a un nuevo tipo de datos. Aquí, usamos el tensor de PyTorch para adaptar la red de dos capas a datos aleatorios. Al igual que en el ejemplo anterior, necesitamos implementar manualmente la transmisión directa e inversa a través de la red:

# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

PyTorch: tensores y autogrado

En el ejemplo anterior, debemos implementar manualmente los pases hacia adelante y hacia atrás de la red neuronal. Para una red pequeña de dos capas, la transferencia inversa manual no es un gran problema, pero para una red grande y compleja, puede volverse muy problemática rápidamente.

Afortunadamente, podemos usar la diferenciación automática para calcular automáticamente la propagación hacia atrás en una red neuronal. PyTorch en el paquete autograd proporciona esta función por completo. Cuando se usa autograd, el pase directo de la red definirá un gráfico computacional ; los nodos en el gráfico son tensores y los bordes son funciones que generan tensores de salida a partir de tensores de entrada. Luego, la propagación hacia atrás se realiza a través de este gráfico y el gradiente se puede calcular fácilmente. Esto suena complicado y es muy sencillo de utilizar en la práctica. Cada tensor representa un nodo en el gráfico de cálculo. Si xes un tensor, x.requires_grad=Trueentonces x.gradotro tensor, que mantiene xel gradiente en relación con un valor escalar. Aquí, usamos el tensor de PyTorch y el autogrado para implementar nuestra red de dos capas. Ahora ya no necesitamos implementar manualmente la transmisión inversa a través de la red:

# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

PyTorch: Definición de nuevas funciones de autogrado

En PyTorch, podemos definir fácilmente nuestros propios operadores de autogrado definiendo torch.autograd.Functione implementando subclases de funciones forward y backward. Luego, podemos llamar al nuevo operador de autogrado como una función construyendo una instancia y pasando un tensor que contiene los datos de entrada. En este ejemplo, definimos nuestra propia función de autogrado personalizada para realizar la no linealidad de ReLU y la usamos para implementar nuestra red de dos capas:

# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

PyTorch: nn

Los gráficos computacionales y el autogrado son ejemplos muy poderosos de definir operadores complejos y adoptar derivadas automáticamente. Sin embargo, para grandes redes neuronales, el autogrado original puede ser un poco demasiado bajo. Al construir la red neuronal, a menudo consideramos dividirla en varias capas , en donde algunas de las capas tienen un estudio de parámetros , estos parámetros se optimizarán en el proceso de aprendizaje. En TensorFlow, paquetes como Keras , TensorFlow-Slim y TFLearn proporcionan abstracciones de nivel superior en el gráfico de cálculo original, que son útiles para construir redes neuronales. En PyTorch, este nnpaquete logra el mismo propósito. Este nn paquete define un conjunto de módulos , que son aproximadamente equivalentes a capas de redes neuronales. El módulo recibe tensores de entrada y calcula los tensores de salida, pero también puede mantener estados internos, como tensores que contienen parámetros que se pueden aprender. El nnpaquete también define un conjunto de funciones de pérdida útiles, que generalmente se utilizan al entrenar redes neuronales. En este ejemplo, usamos este nnpaquete para implementar nuestra red de dos capas:

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

continuará. . . . . . . . . . . . . . . . . . .