Federated learning-Tensorflow implements federated model AlexNet on CIFAR-10

Table of contents

Client

Server side

Expand

Client.py

Server.py

Dataset.py

Model.py


Share a method to implement federated learning, which has the following advantages:

There is no need to read and write files to save and switch Client models.
There is no need to re-initialize Client variables in each epoch.
The memory footprint is as small as possible (the number of parameters is only doubled, that is, Client + Server).
Switching Client only adds some assignment operations.

The goal of learning is a better model, which is kept by the Server and updated by the Clients.
The data (Data) is kept by the Clients and uses the code environment and library dependencies of the article:

Python 3.7
Tensorflow v1.14.x
tqdm (a Python module)


Next, this article will be divided into client-side and server-side code design and implementation to explain. If you are too lazy to read the explanation, you can go directly to the complete code chapter at the end. There are four code files. Run python Server.py and you can immediately experience the original (single-machine simulation) federated learning.

Client


Clarify the tasks of the Client, including the following three steps:

Load the model variables sent from the server to the model.
Update the current model with all your own data.
Send the updated model variables back to the server
. Under these tasks, we can design some functions that the client code needs to have:

Create and train the Tensorflow model (that is, the calculation graph).
Load the model variable values ​​​​sent from the server.
Extract the variable values ​​​​of the current model and send them to the server
to maintain its own data set for training
. In fact, if you think about it carefully, it is better than the tf usually written. The model code has additional functions for loading and extracting model variables. Assuming that the Client class has built the model, then sess.run() each variable to get the value of the model variable. The following code shows the definition of part of the Clients class. The get_client_vars function will return all trainable variable values ​​​​in the calculation graph:

class Clients:
    def __init__(self, input_shape, num_classes, learning_rate, clients_num):
        self.graph = tf.Graph()
        self.sess = tf.Session(graph=self.graph)
        
        """ 本函数未完待续... """
        
        
    def get_client_vars(self):
        """ Return all of the variables list """
        with self.graph.as_default():
            client_vars = self.sess.run(tf.trainable_variables())
        return client_vars


Load the global_vars sent from the server to the model variables. The core lies in the tf.Variable.load() function, which loads the value of a Tensor into the model variables, for example:

variable.load(tensor, sess)


Assign the value of tensor (type tf.Tensor) to variable (type tf.Varibale), and sess is tf.Session.

If you want to load all variable values ​​in the entire model, you can use tf.trainable_variables() to obtain all trainable variables (a list) in the calculation graph. After ensuring that it corresponds to the order of global_vars, you can implement it like this:

    def set_global_vars(self, global_vars):
        """ Assign all of the variables with global vars """
        with self.graph.as_default():
            all_vars = tf.trainable_variables()
            for variable, value in zip(all_vars, global_vars):
                variable.load(value, self.sess)


In addition, the Clients class also requires model definition and training. I believe this is not the point of implementing federation, so in the following code, I remove the function body and leave only the interface definition (the complete code is in the last chapter):

import tensorflow as tf
import numpy as np
from collections import namedtuple
import math

# Customized model definition function
from Model import AlexNet
# Customized data set class
from Dataset import Dataset

# The definition of fed model # Use namedtuple to
store a model, in order:
# loss_op: As the name suggests # loss_op: As the name suggests FedModel = namedtuple('FedModel', 'XY DROP_RATE train_op loss_op acc_op')






class Clients:
    def __init__(self, input_shape, num_classes, learning_rate, clients_num):
        self.graph = tf.Graph()
        self.sess = tf.Session(graph=self.graph)

        # Call the create function to build the computational graph of AlexNet
        # `net` is a list, containing in turn the computing nodes required by FedModel in the model (see above)
        net = AlexNet(input_shape, num_classes, learning_rate, self.graph)
        self.model = FedModel(*net)

        # initialize 初始化
        with self.graph.as_default():
            self.sess.run(tf.global_variables_initializer())

        # Load Cifar-10 dataset
        # NOTE: len(self.dataset.train) == clients_num
        # Load the dataset. For the training set: `self.dataset.train[56]` can obtain the data set of client No. 56
        # `self.dataset.train[56].next_batch(32)` can obtain a batch of client No. 56, the size is 32
        # For the test set, all clients share a test set, therefore:
        # `self.dataset.test.next_batch(1000)` will obtain a data set of size 1000 (no randomization)
        self.dataset = Dataset(tf.keras.datasets .cifar10.load_data,
                        split=clients_num)

    def run_test(self, num):
        """
            Predict the testing set, and report the acc and loss
            Predict the test set, return the accuracy and loss
        
            num: number of testing instances
        """
        pass

    def train_epoch(self, cid, batch_size=32, dropout_rate=0.5):
        """
            Train one client with its own data for one epoch
            Use the data of the client with `cid` number to train the model
            cid: Client id
        """
        pass
        
    def choose_clients(self, ratio=1.0):
        """
            randomly choose some clients
            randomly select clients with `ratio` ratio, return number (that is, subscript)
        """
        client_num = self.get_clients_num()
        choose_num = math.floor( client_num * ratio)
        return np.random.permutation(client_num)[:choose_num]
    
    def get_clients_num(self):
        """ Returns the number of clients"""
        return len(self.dataset.train)


Careful students may have discovered that the class name is Clients, which is a plural number, indicating a collection of Clients. But there is only one model self.model. The reason is: the models of different Clients are actually the same, but the data is different; the class member self.dataset has divided the data. When different clients need to participate in training, you only need to use the one given by the Server. The variable value overwrites the model variable, and then uses the subscript cid to find the Client's data for training.

Of course, the most important reason for this implementation is to avoid building so many Client calculation graphs. We don’t have that much video memory. Let’s
summarize it: the clients of federated learning are just ordinary TF training model code, plus the function of value extraction and assignment of model variables.

Server side


According to the routine, clarify the main tasks of the server-side code:

Use Clients: Update a set of model variables to a Client and get the updated variable values ​​back.
Manage the global model: In each round of updates, collect the updated models from multiple Clients and summarize them into a new round of models.
For the sake of simplicity, our server-side code is no longer abstracted into a class, but written in the form of a script. First, instantiate the Clients we defined above:

from Client import Clients

def buildClients(num):
    learning_rate = 0.0001
    num_input = 32  # image shape: 32*32
    num_input_channel = 3  # image channel: 3
    num_classes = 10  # Cifar-10 total classes (0-9 digits)

    #create Client and model
    return Clients(input_shape=[None, num_input, num_input, num_input_channel],
                  num_classes=num_classes,
                  learning_rate=learning_rate,
                  clients_num=num)

CLIENT_NUMBER = 100
client = buildClients(CLIENT_NUMBER)
global_vars = client.get_client_vars()


The client variable stores the models (actually only one calculation graph) and data of CLIENT_NUMBER Clients. global_vars stores the server-side model variable values, which is our famous training target. Currently, it is only the value of the client-side model initialization.

Next, for an epoch of the Server, the Server will randomly select a certain proportion of Clients to participate in this round of training, give them the current server-side model global_vars for updating, and collect their updated variables respectively. After collecting all the Clients participating in this round of training, average these updated variable values ​​to get a new round of server-side models, and then proceed to the next epoch. The following is the code for cyclic epoch update, please read the comments carefully:

def run_global_test(client, global_vars, test_num):
    """ Run the test set and output ACC and Loss """
    client.set_global_vars(global_vars)
    acc, loss = client.run_test(test_num)
    print("[epoch {}, { } inst] Testing ACC: {:.4f}, Loss: {:.4f}".format(
        ep + 1, test_num, acc, loss))


CLIENT_RATIO_PER_ROUND = 0.12 # The proportion of clients selected and run in each round
epoch = 360 # The upper limit of epoch

for ep in range(epoch):
    # We are going to sum up active clients' vars at each epoch
    # Used to collect Clients-side parameters and add them all up (to save memory)
    client_vars_sum = None

    # Choose some clients that will train on this epoch
    # Randomly select some Clients for training
    random_clients = client.choose_clients(CLIENT_RATIO_PER_ROUND)

    # Train with these clients
    # Use these Clients to train and collect their updated models
    for client_id in tqdm(random_clients, ascii=True):
        # Restore global vars to client's model
        # Load the server-side model into the Client model
        client.set_global_vars (global_vars)

        # train one client
        # Train the Client of this subscript
        client.train_epoch(cid=client_id)

        # obtain current client's vars
        # Obtain the model variable value of the current Client
        current_client_vars = client.get_client_vars()

        # sum it up
        # Superpose the parameters of each layer
        if client_vars_sum is None:
            client_vars_sum = current_client_vars
        else:
            for cv, ccv in zip(client_vars_sum, current_client_vars):
                cv += ccv

    # obtain the avg vars as global vars
    # Divide the superimposed client model variables by the number of Clients participating in this round of training
    # Get the average model as the new round of server model parameters
    global_vars = []
    for var in client_vars_sum:
        global_vars .append(var / len(random_clients))

    # run test on 1000 instances
    # Run the test set and output
    run_global_test(client, global_vars, test_num=600)


After a few rounds of iterations, we can get the trained model parameters global_vars on the server side. Although its logic is very simple, I hope the audience can notice two federation points: the server-side code does not touch the data; the number of Clients participating in each training is very small relative to the whole.

Expand


If you want to change the model, you only need to implement the new model calculation graph constructor and replace the AlexNet function on the client side to ensure that it returns that series of calculation nodes.

If you want to implement Non-IID data distribution, you only need to modify the data division method in Dataset.py. However, I did a little experiment and found that the current model + training method cannot cope with extreme Non-IID situations. It also proves that Non-IID is indeed a difficult problem in federated learning.

If you want to transfer model gradients between Clients and Server, you need to separate the calculated gradients and updated variables on the Client side, and insert the interaction with the Server side in the middle. The interaction content is the gradient. This is a bit abstract. Many students may often use Optimizer.minimize (the document is here), but they do not know that it is a combination of two other functions: compute_gradients() and apply_gradients(). The former is to calculate the gradient, and the latter is to update the gradient to the variable according to the learning rate. After getting the gradient, hand it to the server, and the server returns a globally averaged gradient and then updates the model. It is feasible to try, but it cannot reduce the transmission volume, and it is much more difficult to implement single-machine simulation.

If you want distributed deployment, put the Client-side code under a web back-end service such as Flask for deployment, and the Server-side communicates with Clients through network transmission. It should be noted that when the server initiates a request, there may be some problems due to the large number of parameters. Consider changing to a non-HTTP protocol.

Complete code
There are four code files in total, and they should be placed in the same file directory:

Client.py: Client-side code, management model, data
Server.py: Server-side code, management Clients, global model
Dataset.py: Define the organizational form of data Model.py
: Define the calculation graph of the TF model.
I also passed them to On Github, warehouse link: https://github.com/Zing22/tf-fed-demo.

Their complete codes are posted below. There are only a few comments that I wrote while coding. More Chinese comments have been added to the introduction above. The running method is very simple:

python Server.py


Client.py


import tensorflow as tf
import numpy as np
from collections import namedtuple
import math

from Model import AlexNet
from Dataset import Dataset

# The definition of fed model
FedModel = namedtuple('FedModel', 'X Y DROP_RATE train_op loss_op acc_op')

class Clients:
    def __init__(self, input_shape, num_classes, learning_rate, clients_num):
        self.graph = tf.Graph()
        self.sess = tf.Session(graph=self.graph)

        # Call the create function to build the computational graph of AlexNet
        net = AlexNet(input_shape, num_classes, learning_rate, self.graph)
        self.model = FedModel(*net)

        # initialize
        with self.graph.as_default():
            self.sess.run(tf.global_variables_initializer())

        # Load Cifar-10 dataset
        # NOTE: len(self.dataset.train) == clients_num
        self.dataset = Dataset(tf.keras.datasets.cifar10.load_data,
                        split=clients_num)

    def run_test(self, num):
        with self.graph.as_default():
            batch_x, batch_y = self.dataset.test.next_batch(num)
            feed_dict = {
                self.model.X: batch_x,
                self.model.Y: batch_y,
                self.model.DROP_RATE: 0
            }
        return self.sess.run([self.model.acc_op, self.model.loss_op],
                             feed_dict=feed_dict)

    def train_epoch(self, cid, batch_size=32, dropout_rate=0.5):
        """
            Train one client with its own data for one epoch
            cid: Client id
        """
        dataset = self.dataset.train[cid]

        with self.graph.as_default():
            for _ in range(math.ceil(dataset.size / batch_size)):
                batch_x, batch_y = dataset.next_batch(batch_size)
                feed_dict = {
                    self.model.X: batch_x,
                    self.model.Y: batch_y,
                    self.model.DROP_RATE: dropout_rate
                }
                self.sess.run(self.model.train_op, feed_dict=feed_dict)

    def get_client_vars(self):
        """ Return all of the variables list """
        with self.graph.as_default():
            client_vars = self.sess.run(tf.trainable_variables())
        return client_vars

    def set_global_vars(self, global_vars):
        """ Assign all of the variables with global vars """
        with self.graph.as_default():
            all_vars = tf.trainable_variables()
            for variable, value in zip(all_vars, global_vars):
                variable.load(value, self.sess)

    def choose_clients(self, ratio=1.0):
        """ randomly choose some clients """
        client_num = self.get_clients_num()
        choose_num = math.ceil(client_num * ratio)
        return np.random.permutation(client_num)[:choose_num]

    def get_clients_num(self):
        return len(self.dataset.train)


Server.py


import tensorflow as tf
from tqdm import tqdm

from Client import Clients

def buildClients(num):
    learning_rate = 0.0001
    num_input = 32  # image shape: 32*32
    num_input_channel = 3  # image channel: 3
    num_classes = 10  # Cifar-10 total classes (0-9 digits)

    #create Client and model
    return Clients(input_shape=[None, num_input, num_input, num_input_channel],
                  num_classes=num_classes,
                  learning_rate=learning_rate,
                  clients_num=num)


def run_global_test(client, global_vars, test_num):
    client.set_global_vars(global_vars)
    acc, loss = client.run_test(test_num)
    print("[epoch {}, {} inst] Testing ACC: {:.4f}, Loss: {:.4f}".format(
        ep + 1, test_num, acc, loss))


#### SOME TRAINING PARAMS ####
CLIENT_NUMBER = 100
CLIENT_RATIO_PER_ROUND = 0.12
epoch = 360


#### CREATE CLIENT AND LOAD DATASET ####
client = buildClients(CLIENT_NUMBER)

#### BEGIN TRAINING ####
global_vars = client.get_client_vars()
for ep in range(epoch):
    # We are going to sum up active clients' vars at each epoch
    client_vars_sum = None

    # Choose some clients that will train on this epoch
    random_clients = client.choose_clients(CLIENT_RATIO_PER_ROUND)

    # Train with these clients
    for client_id in tqdm(random_clients, ascii=True):
        # Restore global vars to client's model
        client.set_global_vars(global_vars)

        # train one client
        client.train_epoch(cid=client_id)

        # obtain current client's vars
        current_client_vars = client.get_client_vars()

        # sum it up
        if client_vars_sum is None:
            client_vars_sum = current_client_vars
        else:
            for cv, ccv in zip(client_vars_sum, current_client_vars):
                cv += ccv

    # obtain the avg vars as global vars
    global_vars = []
    for var in client_vars_sum:
        global_vars.append(var / len(random_clients))

    # run test on 600 instances
    run_global_test(client, global_vars, test_num=600)


#### FINAL TEST ####
run_global_test(client, global_vars, test_num=10000)


Dataset.py


import numpy as np
from tensorflow.keras.utils import to_categorical


class BatchGenerator:
    def __init__(self, x, yy):
        self.x = x
        self.y = yy
        self.size = len(x)
        self.random_order = list(range(len(x)))
        np.random.shuffle(self.random_order)
        self.start = 0
        return

    def next_batch(self, batch_size):
        perm = self.random_order[self.start:self.start + batch_size]

        self.start += batch_size
        if self.start > self.size:
            self.start = 0

        return self.x[perm], self.y[perm]

    # support slice
    def __getitem__(self, val):
        return self.x[val], self.y[val]


class Dataset(object):
    def __init__(self, load_data_func, one_hot=True, split=0):
        (x_train, y_train), (x_test, y_test) = load_data_func()
        print("Dataset: train-%d, test-%d" % (len(x_train), len(x_test)))

        if one_hot:
            y_train = to_categorical(y_train, 10)
            y_test = to_categorical(y_test, 10)

        x_train = x_train.astype('float32') / 255
        x_test = x_test.astype('float32') / 255

        if split == 0:
            self.train = BatchGenerator(x_train, y_train)
        else:
            self.train = self.splited_batch(x_train, y_train, split)

        self.test = BatchGenerator(x_test, y_test)

    def splited_batch(self, x_data, y_data, split):
        res = []
        for x, y in zip(np.split(x_data, split), np.split(y_data, split)):
            assert len(x) == len(y)
            res.append(BatchGenerator(x, y))
        return res


Model.py


import tensorflow as tf
import numpy as np
from tensorflow.compat.v1.train import AdamOptimizer

#### Create tf model for Client ####

def AlexNet(input_shape, num_classes, learning_rate, graph):
    """
        Construct the AlexNet model.
        input_shape: The shape of input (`list` like)
        num_classes: The number of output classes (`int`)
        learning_rate: learning rate for optimizer (`float`)
        graph: The tf computation graph (`tf.Graph`)
    """
    with graph.as_default():
        X = tf.placeholder(tf.float32, input_shape, name='X')
        Y = tf.placeholder(tf.float32, [None, num_classes], name='Y')
        DROP_RATE = tf.placeholder(tf.float32, name='drop_rate')

        # 1st Layer: Conv (w ReLu) -> Lrn -> Pool
        # conv1 = conv(X, 11, 11, 96, 4, 4, padding='VALID', name='conv1')
        conv1 = conv(X, 11, 11, 96, 2, 2, name='conv1')
        norm1 = lrn(conv1, 2, 2e-05, 0.75, name='norm1')
        pool1 = max_pool(norm1, 3, 3, 2, 2, padding='VALID', name='pool1')

        # 2nd Layer: Conv (w ReLu)  -> Lrn -> Pool with 2 groups
        conv2 = conv(pool1, 5, 5, 256, 1, 1, groups=2, name='conv2')
        norm2 = lrn(conv2, 2, 2e-05, 0.75, name='norm2')
        pool2 = max_pool(norm2, 3, 3, 2, 2, padding='VALID', name='pool2')

        # 3rd Layer: Conv (w ReLu)
        conv3 = conv(pool2, 3, 3, 384, 1, 1, name='conv3')

        # 4th Layer: Conv (w ReLu) splitted into two groups
        conv4 = conv(conv3, 3, 3, 384, 1, 1, groups=2, name='conv4')

        # 5th Layer: Conv (w ReLu) -> Pool splitted into two groups
        conv5 = conv(conv4, 3, 3, 256, 1, 1, groups=2, name='conv5')
        pool5 = max_pool(conv5, 3, 3, 2, 2, padding='VALID', name='pool5')

        # 6th Layer: Flatten -> FC (w ReLu) -> Dropout
        # flattened = tf.reshape(pool5, [-1, 6*6*256])
        # fc6 = fc(flattened, 6*6*256, 4096, name='fc6')

        flattened = tf.reshape(pool5, [-1, 1 * 1 * 256])
        fc6 = fc_layer(flattened, 1 * 1 * 256, 1024, name='fc6')
        dropout6 = dropout(fc6, DROP_RATE)

        # 7th Layer: FC (w ReLu) -> Dropout
        # fc7 = fc(dropout6, 4096, 4096, name='fc7')
        fc7 = fc_layer(dropout6, 1024, 2048, name='fc7')
        dropout7 = dropout(fc7, DROP_RATE)

        # 8th Layer: FC and return unscaled activations
        logits = fc_layer(dropout7, 2048, num_classes, relu=False, name='fc8')

        # loss and optimizer
        loss_op = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
                                                        labels=Y))
        optimizer = AdamOptimizer(
            learning_rate=learning_rate)
        train_op = optimizer.minimize(loss_op)

        # Evaluate model
        prediction = tf.nn.softmax(logits)
        pred = tf.argmax(prediction, 1)

        # accuracy
        correct_pred = tf.equal(pred, tf.argmax(Y, 1))
        accuracy = tf.reduce_mean(
            tf.cast(correct_pred, tf.float32))

        return X, Y, DROP_RATE, train_op, loss_op, accuracy


def conv(x, filter_height, filter_width, num_filters,
            stride_y, stride_x, name, padding='SAME', groups=1):
    """Create a convolution layer.

    Adapted from: https://github.com/ethereon/caffe-tensorflow
    """
    # Get number of input channels
    input_channels = int(x.get_shape()[-1])

    # Create lambda function for the convolution
    convolve = lambda i, k: tf.nn.conv2d(
        i, k, strides=[1, stride_y, stride_x, 1], padding=padding)

    with tf.variable_scope(name) as scope:
        # Create tf variables for the weights and biases of the conv layer
        weights = tf.get_variable('weights',
                                    shape=[
                                        filter_height, filter_width,
                                        input_channels / groups, num_filters
                                    ])
        biases = tf.get_variable('biases', shape=[num_filters])

    if groups == 1:
        conv = convolve(x, weights)

    # In the cases of multiple groups, split inputs & weights and
    else:
        # Split input and weights and convolve them separately
        input_groups = tf.split(axis=3, num_or_size_splits=groups, value=x)
        weight_groups = tf.split(axis=3,
                                    num_or_size_splits=groups,
                                    value=weights)
        output_groups = [
            convolve(i, k) for i, k in zip(input_groups, weight_groups)
        ]

        # Concat the convolved output together again
        conv = tf.concat(axis=3, values=output_groups)

    # Add biases
    bias = tf.reshape(tf.nn.bias_add(conv, biases), tf.shape(conv))

    # Apply relu function
    relu = tf.nn.relu(bias, name=scope.name)

    return replay


def fc_layer(x, input_size, output_size, name, relu=True, k=20):
    """Create a fully connected layer."""

    with tf.variable_scope(name) as scope:
        # Create tf variables for the weights and biases.
        W = tf.get_variable('weights', shape=[input_size, output_size])
        b = tf.get_variable('biases', shape=[output_size])
        # Matrix multiply weights and inputs and add biases.
        z = tf.nn.bias_add(tf.matmul(x, W), b, name=scope.name)

    if relu:
        # Apply ReLu non linearity.
        a = tf.nn.relu(z)
        return a

    else:
        return z


def max_pool(x,
                filter_height, filter_width,
                stride_y, stride_x,
                name, padding='SAME'):
    """Create a max pooling layer."""
    return tf.nn.max_pool2d(x,
        ksize=[1, filter_height, filter_width, 1],
        strides=[1, stride_y, stride_x, 1],
        padding=padding,
        name=name)


def lrn(x, radius, alpha, beta, name, bias=1.0):
    """Create a local response normalization layer."""
    return tf.nn.local_response_normalization(x,
        depth_radius=radius,
        alpha=alpha,
        beta=beta,
        bias=bias,
        name=name)


def dropout(x, rate):
    """Create a dropout layer."""
    return tf.nn.dropout(x, rate=rate)

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/133325154