Distributed machine learning (Parameter Server)

In distributed machine learning, the parameter server (Parameter Server) is used to manage and share model parameters. The basic idea is to store model parameters on one or more central servers, and share these parameters with each computing machine participating in the training through the network. node. Each calculation node can obtain the current model parameters from the parameter server, and return the calculation results to the parameter server for updating.

In order to maintain model consistency, the following two methods are generally adopted:

  1. Save the model parameters on a centralized node. When a computing node needs to perform model training, it can obtain parameters from the centralized node, perform model training, and then push the updated model back to the centralized node. Since all computing nodes obtain parameters from the same centralized node, model consistency can be guaranteed.
  2. Each compute node holds a copy of the model parameters, so to periodically force-sync the model copy, each compute node uses its own training data partition to train the local model copy. After each training iteration, the copies of the model stored on different compute nodes may be different due to training with different input data. Therefore, a global synchronization step is inserted after each training iteration, which will average the parameters on different computing nodes in order to ensure the consistency of the model in a fully distributed manner, that is, the All-Reduce paradigm

PS architecture

In this architecture, there are two roles: parameter server and worker

The parameter server will be regarded as the master node in the Master/Worker architecture, and the worker will act as the computing node responsible for model training

The workflow of the whole system is divided into 4 stages:

  1. Pull Weights : All workers get weight parameters from the parameter server
  2. Push Gradients : Each worker uses local training data to train a local model, generates local gradients, and then uploads the gradients to the parameter server
  3. Aggregate Gradients : After collecting the gradients sent by all computing nodes, sum the gradients
  4. Model Update : Calculate the cumulative gradient, and the parameter server uses this cumulative gradient to update the model parameters on the centralized server

It can be seen that the above-mentioned Pull Weights and Push Gradients involve communication. First, for Pull Weights, the parameter server sends weights to workers at the same time. This is a one-to-many communication mode, called fan-out communication mode. Assume that the communication bandwidth of each node (parameter server and worker node) is 1. Assuming that there are N working nodes in this data parallel training job, since the centralized parameter server needs to send the model to N working nodes at the same time, the sending bandwidth (BW) of each working node is only 1/N. On the other hand, the receiving bandwidth of each working node is 1, which is much larger than the sending bandwidth 1/N of the parameter server. Therefore, in the stage of pulling weights, there is a communication bottleneck on the parameter server side.

For Push Gradients, all workers send gradients to the parameter server concurrently, which is called fan-in communication mode, and the parameter server also has a communication bottleneck.

Based on the above discussion, the communication bottleneck always occurs on the parameter server side, and this problem will be solved by load balancing

Divide the model into N parameter servers, and each parameter server is responsible for updating 1/N model parameters. In fact, the model parameters are sharded and stored on multiple parameter servers, which can alleviate the network bottleneck problem on the parameter server side, reduce the communication load between parameter servers, and improve the overall communication efficiency.

Code

Define the network structure: a simple CNN is defined above

Implement a parameter server:

class Net(nn.Module):
    def __init__(self):
        super(Net,self).__init__()
        if torch.cuda.is_available():
            device = torch.device("cuda:0")
        else:
            device = torch.device("cpu")
 
        self.conv1 = nn.Conv2d(1,32,3,1).to(device)
        self.dropout1 = nn.Dropout2d(0.5).to(device)
        self.conv2 = nn.Conv2d(32,64,3,1).to(device)
        self.dropout2 = nn.Dropout2d(0.75).to(device)
        self.fc1 = nn.Linear(9216,128).to(device)
        self.fc2 = nn.Linear(128,20).to(device)
        self.fc3 = nn.Linear(20,10).to(device)
 
    def forward(self,x):
        x = self.conv1(x)
        x = self.dropout1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = self.dropout2(x)
        x = F.max_pool2d(x,2)
        x = torch.flatten(x,1)
 
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
 
        output = F.log_softmax(x,dim=1)
 
        return output

A simple CNN is defined above

Implement a parameter server:

class ParamServer(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = Net()
 
        if torch.cuda.is_available():
            self.input_device = torch.device("cuda:0")
        else:
            self.input_device = torch.device("cpu")
 
        self.optimizer = optim.SGD(self.model.parameters(),lr=0.5)
 
    def get_weights(self):
        return self.model.state_dict()
 
    def update_model(self,grads):
        for para,grad in zip(self.model.parameters(),grads):
            para.grad = grad
 
        self.optimizer.step()
        self.optimizer.zero_grad()

get_weights gets the weight parameters, update_model updates the model, and uses the SGD optimizer

Implement workers:

class Worker(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = Net()
        if torch.cuda.is_available():
            self.input_device = torch.device("cuda:0")
        else:
            self.input_device = torch.device("cpu")
 
    def pull_weights(self,model_params):
        self.model.load_state_dict(model_params)
 
    def push_gradients(self,batch_idx,data,target):
        data,target = data.to(self.input_device),target.to(self.input_device)
        output = self.model(data)
        data.requires_grad = True
        loss = F.nll_loss(output,target)
        loss.backward()
        grads = []
 
        for layer in self.parameters():
            grad = layer.grad
            grads.append(grad)
 
        print(f"batch {batch_idx} training :: loss {loss.item()}")
 
        return grads

Pull_weights gets the model parameters, push_gradients uploads the gradient

train

The training data set is MNIST

import torch
from torchvision import datasets,transforms
 
from network import Net
from worker import *
from server import *
 
train_loader = torch.utils.data.DataLoader(datasets.MNIST('./mnist_data', download=True, train=True,
               transform = transforms.Compose([transforms.ToTensor(),
               transforms.Normalize((0.1307,),(0.3081,))])),
               batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(datasets.MNIST('./mnist_data', download=True, train=False,
              transform = transforms.Compose([transforms.ToTensor(),
              transforms.Normalize((0.1307,),(0.3081,))])),
              batch_size=128, shuffle=True)
 
def main():
    server = ParamServer()
    worker = Worker()
 
    for batch_idx, (data,target) in enumerate(train_loader):
        params = server.get_weights()
        worker.pull_weights(params)
        grads = worker.push_gradients(batch_idx,data,target)
        server.update_model(grads)
 
    print("Done Training")
 
if __name__ == "__main__":
    main()

Source: Distributed Machine Learning (Parameter Server) - N3ptune - Blog Park (cnblogs.com)

Guess you like

Origin blog.csdn.net/wangonik_l/article/details/131420901