How does pytorch call the graphics card of the m1 chip for deep model training

Acceleration principle

Apple has its own GPU implementation API — Metal, and Pytorch’s acceleration is based on Metal. Specifically, using Apple’s Metal Performance Shaders (MPS) as the backend of PyTorch can accelerate GPU training. The MPS backend extends the PyTorch framework, providing scripts and functions to set up and run operations on the Mac. MPS optimizes compute performance with cores fine-tuned for the unique characteristics of each Metal GPU family. The new device maps machine learning computational graphics and primitives on top of the MPS graphics framework and tuned kernels provided by MPS.

Therefore, the newly added device name is mps, and the usage method is similar to cuda, for example:

import torch
foo = torch.rand(1, 3, 224, 224).to('mps')

device = torch.device('mps')
foo = foo.to(device)

In addition, I found that Pytorch already supports the following devices, which is really unexpected:

cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, 
ideep, hip, ve, ort, mps, xla, lazy, vulkan, meta, hpu

Environment configuration

In order to use this experimental feature, you need to meet the following three conditions:

  1. Have a Mac laptop with Apple Silicon chips (M1, M1 Pro, M1 Pro Max, M1 Ultra)

  2. Arm64-bit Python installed

  3. Installed the latest nightly version of Pytorch

Assuming the machine is ready. We can download the arm64 version of miniconda from here (the file name is Miniconda3 macOS Apple M1 64-bit bash, and the Python environment installed based on it is arm64. The commands to download and install Minicoda are as follows:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh 

chmod +x Miniconda3-latest-MacOSX-arm64.sh 

./Miniconda3-latest-MacOSX-arm64.sh 

Just follow the instructions. After the installation is complete, create a virtual environment and check the Python architecture by checking whether platform.uname()[4] is arm64:

conda config --env --set always_yes true
conda create -n try-mps python=3.8
conda activate try-mps
python -c "import platform; print(platform.uname()[4])"

If the output of the last command is arm64, it means that the Python version is OK and you can go on.

The third step is to install the nightly version of Pytorch, and perform the following operations in the opened virtual environment:

python -m pip  install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

After the execution is complete, check whether the MPS backend is available through the following command:

python -c "import torch;print(torch.backends.mps.is_built())"

If the output is True, it means that the MPS backend is available and you can go on.

run a MNIST

Based on the MNIST example in Pytorch's official example, it has been modified to test the cpu and mps modes. The code is as follows:

from __future__ import print_function
import argparse
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--epochs', type=int, default=1, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--use_gpu', action='store_true', default=False,
                        help='enable MPS')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_gpu = args.use_gpu

    torch.manual_seed(args.seed)

    device = torch.device("mps" if args.use_gpu else "cpu")

    train_kwargs = {
    
    'batch_size': args.batch_size}
    
    if use_gpu:
        cuda_kwargs = {
    
    'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('../data', train=True, download=True,
                       transform=transform)
    dataset2 = datasets.MNIST('../data', train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        scheduler.step()


if __name__ == '__main__':
    t0 = time.time()
    main()
    t1 = time.time()
    print('time_cost:', t1 - t0)

Test CPU:

python main.py

Test MPS:

python main --use_gpu

The test on my M1 machine found that training an Epoch of MNIST took 149.6s CPU time, while using MPS took 18.4s. The improvement effect is significant, or it may be that the CPU is running too fast. In a word, if you can use mps to train the model, you must use mps. The CPU is too slow.

Guess you like

Origin blog.csdn.net/weixin_45277161/article/details/130849661