Technical Dry Goods | Understand the principle of differential privacy in one article!

With the rapid development of the Internet, the Internet has already been integrated into all aspects of people's daily life, and our personal privacy is hardly a secret in the Internet age. How to protect your privacy in the data age? What is differential privacy? The editor uses an article to guide you to understand what differential privacy is, the technical principles behind it, and how to implement differential privacy in MindSpore.

It doesn’t matter if you don’t understand, this weekend at 8pm (2020.6.7 20:00) , the MindSpore Douyin Live Room (Douyin account: MindSpore Gradient Forest Club ) will explain it in detail!

Background on Differential Privacy

In the 1990s, a famous privacy breach occurred in Massachusetts, USA. The state's Group Insurance Commission (GIC) released "anonymized" medical data for public medical research. Before the data was released, in order to prevent privacy leaks, all personal sensitive information in the data, such as ID number, name, and address, was specially deleted. However, in 1997, Dr. Latanya Sweeney of Carnegie Mellon University successfully deciphered the anonymous data after linking the anonymized GIC database (including each patient's birthday, gender, and zip code) with voter registration records and found The medical records of then-Massachusetts Governor William Weld were stolen.

Thirty years later, in 2018, there were multiple privacy data breaches. Facebook user privacy data leakage was fined 1.6 billion US dollars, YTO 1 billion express delivery information leaked, Marriott Hotel 500 million user room opening information leaked, Huazhu Hotel 500 million user data was suspected to be leaked, Cathay Pacific 9.4 million passenger data, etc., privacy leakage issues Emerging in endlessly, privacy protection should be the top priority.

Purpose of Privacy Protection

We hope that after using privacy protection technology, the data can be released safely, making it difficult for attackers to de-anonymize, while at the same time retaining the overall information of the original data to the greatest extent and maintaining its research value. The current research hotspots are mainly in two aspects:

  1. What kind of protection can be provided by privacy protection technology, and what kind of attack can be resisted;

  2. How to preserve the useful information in the original data to the greatest extent while protecting privacy.

Basic concepts of differential privacy

Differential privacy is a new privacy definition proposed by Dwork in 2006 for the privacy leakage of statistical databases. The purpose is to make the database query results insensitive to the changes of individual records in the data set. To put it simply, it means that a single record is in or not in the data set, and has little impact on the query results. Then the attacker cannot infer the specific information of the individual by adding or subtracting a record and observing the change of the query result.

For example, when differential privacy technology is not used, we query the database of hospital A, query the disease status of 100 patients who visited today, and return 10 people with lung cancer, and query the disease status of 99 patients at the same time, return 9 people with lung cancer, Then it can be speculated that the remaining one, Zhang San, suffers from lung cancer, which exposes Zhang San's personal privacy. After using differential privacy technology, query the database of hospital A, query the disease status of 100 patients who visited today, and return the incidence rate of lung cancer as 9.80%, query the disease status of 99 patients who visited today, and return the incidence rate of lung cancer as 9.81%, so it is impossible It is speculated whether the remaining one, Zhang San, has lung cancer.

In machine learning, machine learning algorithms generally use large amounts of data and update model parameters to learn data features. Ideally, these algorithms learn some models that generalize well. However, machine learning algorithms do not distinguish between general and individual features. When we use machine learning to complete an important task, such as lung cancer diagnosis, the released machine learning model may inadvertently reveal the individual characteristics of the training set, and malicious attackers may obtain private information about Zhang San from the released model, so It is necessary to use differential privacy technology to prevent machine learning models from leaking personal privacy data.

Definition of Differential Privacy

Figure 1 Probability of the random algorithm on adjacent datasets

Differential privacy has two important advantages:

  • Differential privacy assumes that the attacker can obtain all other record information except the target record. The sum of these information can be understood as the maximum background knowledge that the attacker can master. Under this strong assumption, differential privacy protection does not need to consider what the attacker has Any possible background knowledge .

  • Differential privacy is built on a rigorous mathematical definition, providing a method for quantifiable evaluation. Therefore, differential privacy protection technology is a recognized stricter and robust privacy protection mechanism .

How to achieve differential privacy

Differential privacy is so good, how to achieve it? A very natural idea is to "add noise". Differential privacy can be achieved by adding an appropriate amount of interference noise. Currently, commonly used mechanisms for adding noise include Laplacian mechanism and exponential mechanism . Among them, the Laplace mechanism is used to protect numerical results, and the exponential mechanism is used to protect discrete results.

So what is an appropriate amount of noise, how much is appropriate, and how to measure it? The amount of noise added is related to the data set. The difference between the data in the age data set is not as large as that in the salary data set, and the amount of noise to be added is inconsistent. Sensitivity is an important factor in determining how much noise should be added.

sensitivity

Sensitivity refers to the maximum impact of deleting any record in the dataset on the query results. There are two kinds of sensitivity in differential privacy, global sensitivity and local sensitivity.

Applications of Differential Privacy

Are you dizzy after reading the above mathematics? In fact, in simple terms, the essence of differential privacy is "noise addition" . Any algorithm that needs privacy protection can use differential privacy. Due to the series-parallel principle of differential privacy, As long as each step in the algorithm meets the differential privacy requirements, the final result of the algorithm will satisfy the differential privacy characteristics . Therefore, differential privacy can be used at any step in the algorithm flow.

In fact, differential privacy was proposed in 1977, but what really made it famous was Craig Federighi, Apple's vice president of software engineering, announcing at the WWDC conference in 2016 that Apple uses localized differential privacy technology to Protect IOS, MAC user privacy. Differential privacy has been successfully deployed in multiple scenarios to improve user experience while protecting user privacy.

For example, use differential privacy technology to collect user statistics on emoji usage in different language environments, and improve QuickType's ability to predict emoji. Learn new words and foreign words according to the user's keyboard input, and update the dictionary in the device to improve the user's keyboard input experience. Another example, according to the use of differential privacy technology to collect high-frequency high-memory-occupancy and high-energy-consumption domain names used by users in Safari applications, provide more resources when loading these websites in IOS and macOS High Sieera systems to improve user browsing experience. In addition, Google also uses localized differential privacy protection technology to collect more than 14 million user behavior statistics from the Chrome browser every day.

In addition to engineering applications in the industry, academic research on differential privacy is more extensive. At present, there are traces of differential privacy in scenarios such as recommendation systems, social network analysis, knowledge transfer, and federated learning.

Differential Privacy Implementation in MindSpore

In MindArmour's differential privacy module Differential-Privacy, a differential privacy optimizer is implemented. Currently, Gaussian mechanism-based differential privacy SGD and Momentum optimizer are supported , and RDP (R'enyi Differential Privacy) is also provided for monitoring differential privacy budgets.

Here we take the LeNet model as an example to illustrate how to use the differential privacy optimizer to train a neural network model on MindSpore.

本例面向CPU、GPU、Ascend 910 AI处理器你可以在这里下载完整的样例代码<https://gitee.com/mindspore/mindarmour/blob/master/example/mnist_demo/lenet5_dp_model_train.py

Import required library files

The following are the common modules we need and the modules of MindSpore.

import os
import argparse
import mindspore.nn as nn
from mindspore import context
from mindspore.train.callback import ModelCheckpoint
from mindspore.train.callback import CheckpointConfig
from mindspore.train.callback import LossMonitor
from mindspore.nn.metrics import Accuracy
from mindspore.train.serialization import load_checkpoint, load_param_into_net
import mindspore.dataset as ds
import mindspore.dataset.transforms.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.transforms.vision import Inter
import mindspore.common.dtype as mstype
from mindarmour.diff_privacy import DPModel
from mindarmour.diff_privacy import DPOptimizerClassFactory
from mindarmour.diff_privacy import PrivacyMonitorFactory
from mindarmour.utils.logger import LogUtil
from lenet5_net import LeNet5
from lenet5_config import mnist_cfg as cfg

Configure environment information

1. Use parserthe module to pass in the necessary information for running, such as running environment settings, data set storage path, etc. The advantage of this is that for frequently changing configurations, it can be input when running the code, making it more flexible to use.

parser = argparse.ArgumentParser(description='MindSpore MNIST Example')
parser.add_argument('--device_target', type=str, default="Ascend", choices=['Ascend', 'GPU', 'CPU'],
                    help='device where the code will be implemented (default: Ascend)')
parser.add_argument('--data_path', type=str, default="./MNIST_unzip",
                    help='path where the dataset is saved')
parser.add_argument('--dataset_sink_mode', type=bool, default=False, help='dataset_sink_mode is False or True')
parser.add_argument('--micro_batches', type=float, default=None,
                    help='optional, if use differential privacy, need to set micro_batches')
parser.add_argument('--l2_norm_bound', type=float, default=1,
                    help='optional, if use differential privacy, need to set l2_norm_bound')
parser.add_argument('--initial_noise_multiplier', type=float, default=0.001,
                    help='optional, if use differential privacy, need to set initial_noise_multiplier')
args = parser.parse_args()

2. Configure the necessary information, including environment information, execution mode, backend information and hardware information.

context.set_context(mode=context.PYNATIVE_MODE,device_target=args.device_target, enable_mem_reuse=False)

Preprocess the dataset

Load the dataset and process it into MindSpore data format.

def generate_mnist_dataset(data_path, batch_size=32, repeat_size=1,num_parallel_workers=1, sparse=True):
    """
    create dataset for training or testing
    """
    # define dataset
    ds1 = ds.MnistDataset(data_path)

    # define operation parameters
    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0

    # define map operations
    resize_op = CV.Resize((resize_height, resize_width),
                          interpolation=Inter.LINEAR)
    rescale_op = CV.Rescale(rescale, shift)
    hwc2chw_op = CV.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.int32)

    # apply map operations on images
    if not sparse:
        one_hot_enco = C.OneHot(10)
        ds1 = ds1.map(input_columns="label", operations=one_hot_enco,
                      num_parallel_workers=num_parallel_workers)
        type_cast_op = C.TypeCast(mstype.float32)
    ds1 = ds1.map(input_columns="label", operations=type_cast_op,
                  num_parallel_workers=num_parallel_workers)
    ds1 = ds1.map(input_columns="image", operations=resize_op,
                  num_parallel_workers=num_parallel_workers)
    ds1 = ds1.map(input_columns="image", operations=rescale_op,
                  num_parallel_workers=num_parallel_workers)
    ds1 = ds1.map(input_columns="image", operations=hwc2chw_op,
                  num_parallel_workers=num_parallel_workers)

    # apply DatasetOps
    buffer_size = 10000
    ds1 = ds1.shuffle(buffer_size=buffer_size)
    ds1 = ds1.batch(batch_size, drop_remainder=True)
    ds1 = ds1.repeat(repeat_size)

    return ds1

Modeling

Here we take the LeNet model as an example, and you can also build and train your own model according to your needs.

from mindspore import nn
from mindspore.common.initializer import TruncatedNormal

def conv(in_channels, out_channels, kernel_size, stride=1, padding=0):
    weight = weight_variable()
    return nn.Conv2d(in_channels, out_channels,
                     kernel_size=kernel_size, stride=stride, padding=padding,
                     weight_init=weight, has_bias=False, pad_mode="valid")
def fc_with_initialize(input_channels, out_channels):
    weight = weight_variable()
    bias = weight_variable()
    return nn.Dense(input_channels, out_channels, weight, bias)

def weight_variable():
    return TruncatedNormal(0.02)

class LeNet5(nn.Cell):
    """
    Lenet network
    """
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = conv(1, 6, 5)
        self.conv2 = conv(6, 16, 5)
        self.fc1 = fc_with_initialize(16*5*5, 120)
        self.fc2 = fc_with_initialize(120, 84)
        self.fc3 = fc_with_initialize(84, 10)
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

Load the LeNet network, define the loss function, configure the checkpoint, and load the data with the data loading function generate_mnist_dataset defined above.

network = LeNet5()
net_loss = nn.SoftmaxCrossEntropyWithLogits(is_grad=False, sparse=True, reduction="mean")
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps,
                             keep_checkpoint_max=cfg.keep_checkpoint_max)
ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet",
                             directory='./trained_ckpt_file/',
                             config=config_ck)

ds_train = generate_mnist_dataset(os.path.join(args.data_path, "train"),
                                  cfg.batch_size,
                                  cfg.epoch_size)

Introduce differential privacy

1. Configure the parameters of the differential privacy optimizer.

  • Set the dataset batch size.

  • Instantiate the differential privacy factory class.

  • Set the noise mechanism for differential privacy. Currently, Gaussian noise mechanism with fixed standard deviation: Gaussian and adaptive Gaussian noise mechanism with adaptive adjustment standard deviation: AdaGaussian are supported.

  • Set the optimizer type, currently supports SGD and Momentum.

  • Set the differential privacy budget monitor RDP to observe the change of the differential privacy budget in each step.

gaussian_mech = DPOptimizerClassFactory(args.micro_batches)
gaussian_mech.set_mechanisms('Gaussian',
                             norm_bound=args.l2_norm_bound,
                             initial_noise_multiplier=args.initial_noise_multiplier)
net_opt = gaussian_mech.create('Momentum')(params=network.trainable_params(),
                                           learning_rate=cfg.lr,
                                           momentum=cfg.momentum)
rdp_monitor = PrivacyMonitorFactory.create('rdp',
                                           num_samples=60000,
                                           batch_size=16,
                                           initial_noise_multiplier=5,
                                           target_delta=0.5,
                                           per_print_times=10)

2. To package the LeNet model into a differential privacy model, you only need to pass the network into DPModel.

model = DPModel(micro_batches=args.micro_batches,
                norm_clip=args.l2_norm_bound,
                dp_mech=gaussian_mech.mech,
                network=network,
                loss_fn=net_loss,
                optimizer=net_opt,
                metrics={"Accuracy": Accuracy()})

3. Model training and testing.

LOGGER.info(TAG, "============== Starting Training ==============")
model.train(cfg['epoch_size'], ds_train, callbacks=[ckpoint_cb, LossMonitor(), rdp_monitor],
dataset_sink_mode=args.dataset_sink_mode)

LOGGER.info(TAG, "============== Starting Testing ==============")
ckpt_file_name = 'trained_ckpt_file/checkpoint_lenet-10_1875.ckpt'
param_dict = load_checkpoint(ckpt_file_name)
load_param_into_net(network, param_dict)
ds_eval = generate_mnist_dataset(os.path.join(args.data_path, 'test'), batch_size=cfg.batch_size)
acc = model.eval(ds_eval, dataset_sink_mode=False)
LOGGER.info(TAG, "============== Accuracy: %s  ==============", acc)

4. Results display

The accuracy of the Lenet model without differential privacy is stable at 99%, the Lenet model with adaptive differential privacy AdaDP converges, and the accuracy is stable at 96%, and the LeNet model with non-adaptive differential privacy DP converges, and the accuracy is stable at about 94% .

Figure 4 Comparison of training results

Guess you like

Origin blog.csdn.net/Kenji_Shinji/article/details/126720793
Recommended