Federal learn the latest trends in research

2020-03-13 09:48:18

Federal learn the latest trends in research

 

Federal learn fire in 2019, the latest research progress of how kind?

Wen | Jiang Shang Bao

Ed | Jia Wei

 

Federal learning AI is undoubtedly one of the most popular recent technological paradigm in the past 2019, the emergence of a large federal study related research.

Federal is a learning machine learning framework that allows users to use multiple data sets distributed in different locations to train machine learning models, while preventing data leakage and comply with strict data privacy regulations.

You can prevent data leaks! This also means that federal perhaps learning is an important way to solve the sensitive data.

Recently from the Australian National University, institutions of Carnegie Mellon University, Cornell University, Google, Hong Kong University of Science and Technology Scholars have jointly issued a paper, he elaborated on the open issues and challenges facing the sector, and listed a large number of valuable research.

Federal learn the latest trends in research

 

Download paper:

https://arxiv.org/pdf/1912.04977.pdf

This review of the thesis consists of seven parts, introduced from the blurb, it introduces other federal learning settings and issues other than cross-device settings and how to raise the federal learning efficiency and effectiveness and other issues, also addressed user data and privacy, the model is manipulation and failure factors and other hot issues.

 

1 Introduction

Federal study refers to multiple clients (such as mobile devices or entire organization) trained collaborative model of machine learning settings in a central server (such as service providers), while ensuring that the training data set to the center. Some systemic risk and cost of federal privacy learning to use local data collection and minimization principle, reduce the traditional centers of machine learning methods to bring.

Federal learn the term was first proposed by McMahan et al., 2016, but before the birth of the term, there has been a lot of research work dedicated to data privacy protection, such as encryption of data to calculate the 1980s had appeared in encryption method.

Federal study initially focused on mobile applications and edge devices, the researchers set up and these two are called cross-device (cross-device) and cross-silo. Based on these two variants of this paper to learn under a federal broader definition:

Learning is more federal entities (clients) collaborative problem solving machine learning machine learning settings, it is carried out under the coordination of a central server or service provider. Each client's original data is stored locally, can not be exchanged or migration, the Federal learn to use partial update (for immediate polymerization (immediate aggregation)) to achieve learning objectives.

It is worth noting that this definition will complete learning and federal fully decentralized learning techniques to make a distinction.

Federal learn the latest trends in research

 

Cross-device federal learning settings: The figure shows the life cycle of the federal learning and training, as well as learning more participant federal system. Specifically, that workflow includes six parts: identifying a problem; client 2 is provided; prototyping model 3; 4 federation model training; 5 model evaluation; 6 deployment,......

Specific to the training process, including: a client chooses; 2 broadcast; 3 client computing; polymerizable 4; 5 model update..... Client selection step is mainly sampled from meeting the requirements of the client; step of broadcasting the main download the current model weights and training programs from the server from the selected client; and the client computing, aggregation and separation of the update phase model is not federal study stringent requirements, but it does exclude certain categories of algorithms, such as asynchronous SGD.

 

2, other federal cross-device settings other than learning settings and issues

Federal training learning, the server has been playing a central role, when a very large number of clients, the server may become a bottleneck training. The key idea to completely decentralized way is to use point to point communication mode to replace the server-centric.

In the fully centralized algorithm to the client as a communication channel between the nodes, as the client side, and this relationship edge points constituting federal learning network. Note that the state is no longer a global standard in the federal study, the process can be designed so that all local models are expected to converge to a global solution, in other words, each model gradually reach a consensus.

Although it is fully distributed, but any course to have a center responsible for the allocation of learning tasks, these learning tasks include: selection algorithm, ultra-parameter selection, commissioning and so on. Select this center needs to be trusted, its customers may have made the task of learning to play, it can also be decided by consensus.

Federal learn the latest trends in research

Compare federal learning and distributed learning

But to the center of the current program on a machine learning algorithm is still facing a lot of problems, some similar to using a central server for a joint study of the special circumstances, other problems are due to complete the distribution of the side effects produced.

In terms of the algorithm, the main challenge is the impact of network topology and distribution SGD asynchronous, distributed SGD, personalize and update the local trust mechanism, gradient compression and quantitative methods.

Cross-Silo federal study: the joint study of cross-device characteristics contrary, Cross-Silo federal study in certain aspects of the overall design is very flexible. Many organizations If you just want to share training model, and do not want to share time data, cross-silo setting is a very good choice. Cross-Silo set up federal study has the following main points: data segmentation, incentives, differences in privacy, tensor factorization.

Federal learn the latest trends in research

Two kinds of segmentation study set

Learning division (Split Learning): The key idea is to perform segmentation study based on each segmentation model between the client and the server, and applied training and reasoning. The simplest configuration is split prior to study each client is transmitted is calculated by the network to the depth, and then outputs the cut layer, i.e., grinding data is transmitted to another server or client, or the client and the server thereby complete the remaining calculations. Finally, in a similar manner can be from the final layer backpropagation gradient to a cutting layer; this means that before the shared data so that no propagation occurs. Note that this process will continue until convergence.

 

3, how to improve efficiency

The problem with this part of the paper is an open chapters explore various technologies discussed included the development of better optimization algorithms? How to provide differentiated models for different clients? Machine learning how to perform tasks in the context of the federal study?

Solve the problems there are a lot of challenges, one of which is the presence of (sub required operator independent and identically distributed) Non-IID data. This problem occurs because there are three main aspects: 1 different client distribution; 2 violation of the independence assumption; 3 sets of data migration....

Federal learn the latest trends in research

 

How to deal with Non-IID data it? The most common method is to modify the existing algorithms. For some applications, you can choose to expand the data, you can also use some of the ways to make the data more similar across clients. For example, create a small data can be shared globally set.

Another way to improve efficiency is to optimize the algorithm for the federal study, in some typical federal learning tasks, the optimization objective is to minimize "certain functions." The main difference between the joint optimization of distributed algorithms and standard training method is that: the need to address non-IID data and unbalanced data. In addition another federal study of important practical consideration is the algorithm can be combined with other techniques of, for example, adjust and optimize the algorithm state (such as ADMM) and stateful compression strategy based on the actual situation.

Multi-task learning, personalization and meta-learning is very effective in the face of non-IID data, its performance may even exceed the best shared global model. In addition personalized through characterization, such input can make a shared global model produces highly predictive personalization.

In order to make more efficient training effect, you can adjust the machine learning workflow. Because the data standard machine learning workflow enhancements, feature engineering, structural design of the nervous system, model selection, ultra-parameter optimization, when configuring the distributed data collection and resource-constrained mobile devices, there will be many problems.

 

4, protect user data of privacy

 

Federal learn the latest trends in research

Threats model

Machine learning workflow involves various participants. For the user, which can be generated by the training data exchange device. For engineers, machine learning their participation is to train and assess the quality of the model.

In an ideal state, each participant in the system can easily deduce there is no disclosure of their information, all participants can take advantage of these inferences to determine whether to take action.

Thesis on existing results are outlined in this chapter and explains how to design, to be able to provide strict privacy protection challenge, as well as joint learning system now faces. Of course, in addition to attacks against user privacy, as well as for the joint study of other types of attacks; for example, the opponent may attempt to simply stop model training, or trying to get models prejudice.

The paper also discusses the various threats model can provide protection, then a list of some of the core tools and technology. In the trusted server also made assumptions, and discuss the problems of protection of hostility clients and analysts of public issues and challenges.

 

5, robustness against attacks and failures

Modern machine learning system is prone to. These problems may not be malicious, such as pre-processing pipeline errors, noisy training label, do not fly the client, as well as for the training and deployment of an explicit attack. In this section, the paper describes the distributed nature of the federal study, architecture design and data constraints opens up new failure modes and attack surface. It is also worth noting that the protection of privacy in the federal study security mechanisms could make it very difficult to detect and correct.

The paper also discusses the relationship between different types of attacks and failures, and the importance of these relationships in the federal learning.

Confrontational attack on model performance: The attacker may attack not only against the performance of the model, but is possible to infer the user's private data involved in the training. There are many examples of adversarial attacks, including poisoning data, models and model updates avoid poisoning attacks (model evasion attacks).

Non-malicious failure mode (Non-Malicious Failure Modes) : Compared with the traditional data center model training, the impact of the federal study are particularly vulnerable to non-malicious client's fault, and hostile attacks, and system factors can lead to data constraints non-malicious failure. Non-malicious failure is usually smaller than a malicious attack destructive, but a higher frequency of occurrence, but often with malicious attacks share a common origin and complexity. Therefore, the way to deal with non-malicious failure can also be used in the fight against malicious attack it.

Explore the tension between privacy and robustness: often using aggregation technology to enhance security privacy protection, but generally makes it more difficult to attack hostile defense, because the central server only see the collection of client updates, so research in the use of safety how hostile attack defense is very important during polymerization.

Overall first introduced the adversarial attack, then discusses non-malicious failure mode, and finally explores the tension between privacy and robustness.

 

6, to ensure fairness, to eliminate bias

Performance machine learning models are often surprising. When these behavioral models very user-unfriendly, the researchers will be classified as unfair. For example, if people with similar characteristics have been completely different results, then this individual in violation of the standards of fairness. If certain sensitive groups (race, gender, etc.) to get different results, then this may violate various demographic criteria fair ........

Federal study to provide for the study of the fairness of the few thinking, some of which extended the previously non-federal environmental research, others are unique to the federal study.

Deviations in the training data: a driver of machine learning model is an unfair bias in the training data, including cognitive sampling, reporting and confirmation bias. A common phenomenon is that individual characteristic data in the overall data set underrepresented, and therefore the right to get heavy after a training model does not represent a problem. Data access process is like a joint study used data sets may introduce shift and non-independence of the same.

Without fair access to sensitive attributes: clear obtaining demographic information, such as race, gender and so will stimulate discussion about the fairness of the standard, when sensitive personal property is not available, often deploy federal learning environment will lead to a discussion of fairness, for example, personalized language model development and equitable medical classifier. So measuring and correcting inequities is a key issue joint study researchers to be solved.

Fairness, privacy and sound: fair data privacy and ethical concept seems to be complementary in many real-world needs privacy, the fair is also highly desirable. As the joint study are most likely to be deployed in privacy and fairness require sensitive data environment, thus solving fairness and privacy issues is essential.

Increase the use of federal diversity mode: Joint Learning distributed training provided by the previous data may not be practical or even illegally, can be a reasonable use of them. Some current data privacy laws have been forced to enterprise modeling in data silos. In addition, the lack of training data representation and diversity will lead to performance decline, the federal study may be combined with data already associated with the sensitive properties to improve the fairness of these models, thereby improving the performance of the model.

7 Conclusion

Federal learn the distributed client device capable of collaborative learning and sharing of predictive models, while all the training data stored on the device, so as to separate the capacity needs of machine learning and data stored in the cloud.

In recent years, federal study topics in industry and academia have experienced explosive growth. Federal learning in other subject areas also gradually expand the influence: from machine learning to optimize, statistics and information theory to cryptography, privacy and fairness.

Data privacy is not binary, threat model under different assumptions, each model has its own unique challenges.

Open issues discussed in the paper is not comprehensive, it reflects the author's interests and backgrounds. This article does not discuss non-learning problems to solve machine learning project, after all of these issues may need to be resolved based on distributed data. It calculates a basic example of descriptive statistics, calculated on the current open histogram head. Another important topic of discussion is not likely to stimulate or restrict the use of legal and business issues federal study.

Published 472 original articles · won praise 757 · Views 1.61 million +

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/104855428