Architecture Thoughts of Federated Learning

Table of contents

Introduction to Federated Learning (very detailed)

The origin of federated learning

The history of federated learning

1) Machine Learning

2) Distributed machine learning

3) Privacy protection technology

4) Federated Learning

Norms and Standards for Federated Learning

Architecture Thoughts of Federated Learning

Community and Ecology of Federated Learning


Introduction to Federated Learning (very detailed)

Federated learning is a distributed machine learning framework with privacy protection and security encryption technology , which aims to allow decentralized participants to collaborate in machine learning model training without disclosing private data to other participants.

The training process of the classic federated learning framework can be briefly summarized as the following steps:

  • The coordinator establishes the basic model and informs the participants of the basic structure and parameters of the model ;
  • Each participant uses local data for model training and returns the results to the coordinator ;
  • The coordinator summarizes the models of each participant to build a more accurate global model to improve the performance and effect of the model as a whole .


The federated learning framework includes various technologies,

  1. For example, traditional machine learning model training techniques
  2. Algorithm Technology of Coordinator Parameter Integration
  3. Communication technology for efficient transmission between coordinators and participants
  4. Privacy protection encryption technology, etc.

In addition, there is an incentive mechanism in the federated learning framework , data holders can participate, and the benefits are universal .

Google first applied federated learning on Gboard (Google keyboard), combined with user terminal equipment, used the user's local data to train the local model, and then aggregated and distributed the model parameters during the training process, and finally achieved the goal of accurately predicting the next word.

In addition to scattered local users, participants in federated learning can also be multiple enterprises facing the dilemma of data islands. They have independent databases but cannot share them with each other. Federated learning replaces the original remote data transmission by designing encrypted parameter transmission in the training process, which ensures the security and privacy of the data of all parties, and at the same time meets the data security requirements of the laws and regulations that have been issued.

The origin of federated learning

Since artificial intelligence was formally proposed at the Dartmouth Conference in 1956, it has experienced three waves of development. The third wave originated from deep learning technology and made a leap forward. Artificial intelligence technology continues to develop, showing strong vitality in different frontier fields.

However, the development of artificial intelligence technology at this stage is limited by data. Different institutions, organizations, and enterprises have different magnitudes and heterogeneous data, which are difficult to integrate and form islands of data. The current artificial intelligence technology with deep learning as the core is limited by the lack of data, and cannot make a big impact in smart retail, smart finance, smart medical care, smart city, smart industry and other fields of production and life.

In the era of big data, the public is more sensitive to data privacy. In order to strengthen data regulation and privacy protection and ensure the legal effectiveness of personal data as a new asset class, the European Union implemented the General Data Protection Regulation (GDPR) in 2018. China is also constantly improving relevant laws and regulations to regulate the use of data. For example, in 2017, the "Network Security Law of the People's Republic of China" and the "General Principles of the Civil Law of the People's Republic of China" were implemented. The "Opinions of the Central Committee of the Communist Party of China and the State Council on Building a More Complete System and Mechanism for the Market-oriented Allocation of Factors" and "Personal Information Protection Law of the People's Republic of China (Draft)" were launched. These legal entries all indicate that data owners need to accept supervision, have the obligation to protect data, and must not disclose data.

At present, on the one hand, the emergence of data islands and privacy issues has limited the development of traditional artificial intelligence technology and encountered bottlenecks in big data processing methods; potential application value.

Therefore, how to use multi-party heterogeneous data to further learn to promote the development and implementation of artificial intelligence under the premise of meeting data privacy, security and regulatory requirements has become an urgent problem to be solved. Federated learning technology that protects privacy and data security emerges as the times require.

The history of federated learning

Since artificial intelligence was formally proposed, it has experienced more than 60 years of evolution, and has now become a cutting-edge interdisciplinary subject with wide application. As one of the most important branches of artificial intelligence, machine learning has rich application scenarios and many practical applications.

With the advent of the era of big data, the demand for data analysis in all walks of life has increased dramatically. Big data, large models, and algorithms with high computational complexity have put forward higher requirements for the performance of machines. In this context, a single machine may not be able to complete large model training with huge data and high computational complexity, so distributed machine learning technology emerged as the times require . Distributed machine learning uses large-scale heterogeneous computing devices (such as GPU) and multi-machine multi-card clusters for training. The goal is to coordinate and utilize each distributed single machine to complete fast iterative training of the model.

However, the previous traditional distributed machine learning technology needs to first use the centralized management data to learn in a parallel way of data block or model block parallel, which also faces the risk of data leakage from the data management party, which restricts to a certain extent Practical application and generalization of distributed machine learning techniques.

How to combine data privacy protection and distributed machine learning to carry out model training legally and compliantly on the premise of ensuring data security is one of the hot research issues in the field of artificial intelligence. Federated learning technology jointly trains multi-party models on the premise that the data does not go out of the local area, which not only ensures data security and privacy, but also realizes distributed training , which is a feasible way to solve the development dilemma of artificial intelligence.

Next, we outline the development process of federated learning, which is in its growth stage.

1) Machine Learning

The proposal and development of machine learning can be traced back to the 1940s. As early as 1943, Warren McCulloch and Walter Pitts described the computational model of neural networks in their paper "A logical calculus of the ideas immanent in nervous activity". This model draws on the working principle of biological cells and tries to simulate the thinking process of the brain, which has aroused many scholars' interest in neural network research.

In 1956, the Dartmouth Conference formally proposed the concept of artificial intelligence. Just 3 years later, Arthur Samuel gave the concept of machine learning. The so-called machine learning is to study and build a special algorithm (rather than a specific algorithm), which can make the computer learn from the data to make predictions.

However, due to the improper design of the neural network at that time, the huge amount of calculation required, and the limitation of hardware computing power, the neural network was considered impossible to realize, and the research of machine learning has been stagnant for a long time.

Until the 1990s, with the development of high technologies such as cloud computing and heterogeneous computing, many traditional machine learning algorithms were proposed and achieved good results:

  • In 1990, Robert Schapire published the paper "The strength of weak learnability", which proposed that weak learning sets can generate strong learning, which promoted the use of Boosting algorithms in the field of machine learning.
  • In 1995, Corinna Cortes and Vapnik published the paper "Support-vector networks", proposing a model of support vector machines.
  • In 2001, Breinman published the paper "Random forests", proposing the random forest algorithm. With the introduction of deep network models and backpropagation algorithms, neural networks have also returned to the field of research and entered a stage of prosperous development.

2) Distributed machine learning

So far, machine learning has developed many branches, and its application scope is becoming wider and wider. However, as the amount of data continues to grow and the complexity of the model continues to increase, a single node cannot carry a large amount of data information and computing resources, and the development of mainstream machine learning has encountered a bottleneck. In order to solve the problem of slow big data training, distributed machine learning was proposed.

Distributed machine learning technology deploys huge data and computing resources to multiple machines to improve system scalability and computing efficiency.

The core problem of realizing distributed is how to store data and process data in parallel. The current main distributed data processing technology is mainly based on the idea of ​​distributed file storage and task decomposition processing proposed by Google. Google published two papers on Google Distributed File System (GFS) and task decomposition and integration (MapReduce) in 2003 and 2004 respectively, and announced the details.

Based on these core ideas, many enterprises and scientific research institutions have developed corresponding big data computing, big data processing and distributed machine learning platforms. Common platforms for big data computing and processing include Hadoop, Spark, and Flink, etc.:

  • The infrastructure of the Hadoop distributed system was implemented by Apache in 2005. The HDFS distributed file system provides storage space for massive data, and MapReduce provides computing support for massive data, effectively improving the processing speed of big data;
  • The Spark platform was developed by the AMP Laboratory of the University of California, Berkeley, focusing on data flow applications and expanding the application of MapReduce;
  • Flink is a distributed processing framework that supports high throughput, low latency, and high performance at the same time. It has been adopted by more and more domestic companies in recent years.


Distributed machine learning training is divided into two types: data parallelism and model parallelism. Data parallelism is a more commonly used distributed training solution. In this way, all devices maintain a set of parameters by themselves, input different data, and synchronize gradients through the AllReduce method during backpropagation, but it is not suitable for models that are too large. Due to the fact that the data parallelism will cause the model to be too large, a model parallelism scheme is proposed.

Model parallelism mainly includes intra-layer parallelism and inter-layer parallelism, but they have problems of parameter synchronization and update. For this, the industry is exploring more efficient automatic parallel methods, trying to reduce parameter communication through gradient compression, etc. .

With the development of distributed technology, some machine learning/deep learning frameworks have announced support for distributed:

  • At the end of 2013, the machine learning research group led by Professor Xing Bo of Carnegie Mellon University open sourced the Petuum platform, aiming to improve the efficiency of parallel processing. The mainstream deep learning frameworks TensorFlow and PyTorch began to support distributed operation and distributed training in 2016 and 2019, respectively.
  • In January 2017, MXNet, the official open source platform selected by Amazon, and its projects entered the Apache Software Foundation. MXNet supports multiple languages ​​and fast model training.
  • In March 2018, Baidu open sourced PaddlePaddle, a cloud-based distributed deep learning platform.
  • In October 2018, Huawei launched Model Arts, a one-stop AI development platform, which integrated the MoXing distributed training acceleration framework. MoXing is built on the open source deep learning engines TensorFlow, MXNet, PyTorch, and Keras, which makes these computing engines have higher distributed performance and better usability.
  • In January 2019, Intel open-sourced its distributed deep learning platform Nauta, which provides a multi-user distributed computing environment for deep learning model training experiments.

3) Privacy protection technology

How to protect the privacy and security of data in data transmission has always been a research hotspot in the field of cryptography.

As early as 1982, Academician Yao Qizhi proposed the "Millionaire Problem" , that is, two millionaires want to know who is richer, but neither is willing to disclose their wealth figures to each other. to obtain an answer to this question. This problem leads to the research field of multi-party secure computing .
 

Protocol for Domain Exploration Design

It is to solve the problem of how a group of distrusting parties can coordinate calculations without a trusted third party while protecting private information . At present, there are several secure multi-party computing frameworks, and the cryptography technologies involved include obfuscated circuits, secret sharing, homomorphic encryption, inadvertent transmission , etc.:

  • The obfuscation circuit is aimed at the safe calculation of both parties. Its idea is to convert the jointly calculated function into a logic circuit, encrypt and scramble each gate of the circuit, so as to ensure that the original input and intermediate results will not be leaked during the calculation process. respective inputs, the output of each logic gate of the circuit is decrypted until an answer is obtained .
  • The idea of ​​secret sharing is to disassemble the secrets that need to be protected in some appropriate ways and hand them over to different participants for management. Only by working together can the secret information be recovered.
  • The idea of ​​homomorphic encryption was proposed by Rivest in 1978, and then Gentry extended fully homomorphic encryption in his paper "Fully homomorphic encryption using ideal lattices" published in 2009 . Fully homomorphic encryption refers to an encryption function that satisfies the homomorphic properties of addition and multiplication at the same time , and can perform any number of addition and multiplication operations. Through such a function guarantee, after decryption of the data processed by homomorphic encryption, its output is equal to The output of unencrypted raw data after the same operation .
  • Oblivious transmission emphasizes that the communicating parties transmit messages in a way that chooses ambiguity.


In addition to the above technologies with encryption as the core idea, there is also a perturbation method in privacy protection technology, represented by differential privacy technology . In 2008, Dwork proposed the application of differential privacy in the paper "Differential privacy: A survey of results". At present, differential privacy has been widely used in privacy protection. Its main idea is to add interference noise to the data that needs to be protected, so that the query of two data sets with a difference of one record has a high probability of obtaining the same result, thereby avoiding the privacy leakage problem caused by differentiated multiple queries.

4) Federated Learning

With the development of big data and artificial intelligence technology, there are endless social discussions on the violation of personal privacy by enterprise artificial intelligence algorithms. Organizations cannot protect the privacy of all parties when data is combined, and various countries are introducing various privacy protection restrictions. Bill, data islands have become a bottleneck for the development of artificial intelligence.

In this context, federated learning technologies that can solve the problem of data islands and protect data security and privacy have emerged as the times require.

In 2016, Google research scientist Brendan McMahan and others proposed a training framework for federated learning in the paper "Communication-efficient learning of deep networks from decentralized data". The framework uses a central server to coordinate multiple client devices for joint model training .

In April 2017, Brendan McMahan and Daniel Ramage published a blog post "Federated learning: Colla-borative machine learning without centralized training data" on the Google AI Blog, introducing the application and implementation of federated learning in the direction of keyboard prediction, and using the simplified version of the TensorFlow framework.

These explorations of Google have inspired practitioners at home and abroad to explore the enthusiasm of federated learning technology and framework. At present, many institutions at home and abroad have developed model training frameworks and platforms based on federated learning ideas:

  • PyTorch, a deep learning framework developed by Facebook, began to adopt federated learning technology to protect user privacy.
  • WeBank launched the Federated AI Technology Enabler (FATE) open source framework;
  • Ping An Technology, Baidu, Tongdun Technology, JD.com, Tencent, ByteDance and many other companies have successively used federated learning technology to build intelligent platforms, demonstrating its broad prospects for application in multiple fields and industries.


As a new paradigm of artificial intelligence, federated learning can resolve the difficulties faced by the development of big data. As the industry continues to explore industrial-level, commercial-level, and enterprise-level platforms based on federated learning technology, a hundred flowers are blooming in the market. At the same time, the norms and standards of the federated learning architecture have been continuously improved, and the actual commercialization scenarios have gradually increased, and the ecological construction of the federated learning has been initially completed.

Norms and Standards for Federated Learning

At present, various institutions have their own opinions on the connotation, extension, and specific application technical solutions of the concept of federated learning, and no unified norms and standards have been formed.

However, many institutions are actively participating in and guiding the formulation of domestic and foreign standards related to federated learning:

  • In December 2018, the IEEE Standards Association approved the project establishment of the standard P3652.1 (Guide for Architectural Framework and Application of Federated Machine Learning) initiated by WeBank on federated learning architecture and application specifications.
  • In June 2019, China Artificial Intelligence Open Source Software Development Alliance (AIOSS) released the group standard of "Information Technology Service Federated Learning Reference Architecture" led by WeBank.
  • In March 2020, Ant Group took the lead in formulating the "Shared Learning System Technical Requirements" alliance standard, which was approved by the China Artificial Intelligence Industry Development Alliance (AIIA).
  • In June 2020, the 139th electronic meeting of 3GPP SA2 passed the standard proposal of "Federated Learning Between Multiple NWDAF Instances" proposed by China Mobile, and the 3GPP standard introduced the intelligent architecture and process of federated learning.
  • In July 2020, the "Technical Requirements and Test Methods for Data Circulation Products Based on Federated Learning" jointly drafted by China Academy of Information and Communications Technology and Baidu was released for the first time. This is another group standard on federated learning.


In addition, various companies and institutions have released relevant white papers on the theoretical principles and available scenarios of federated learning:

  • WeBank, together with China UnionPay, Ping An Technology, Pengcheng Lab, Tencent Research Institute, China Academy of Information and Communications Technology, and China Merchants Financial Technology, released the "Federal Learning White Paper 2.0;
  • Tongdun Technology Artificial Intelligence Research Institute released the "Knowledge Federation White Paper";
  • Tencent Security released the "Tencent Security Federated Learning Application Service White Paper";
  • IBM 发布《IBM Federated Learning:An Enterprise Framework White Paper V0.1》。

Architecture Thoughts of Federated Learning

There are two types of federated learning architectures, one is centralized federated (client/server) architecture, and the other is decentralized federated (peer-to-peer computing) architecture.

For the federated learning scenario that combines multiple users, the client/server architecture is generally adopted, and the enterprise acts as a server to coordinate the global model; while for the model training scenario that combines multiple enterprises facing the dilemma of data islands, it can generally be A peer-to-peer architecture is adopted because it is difficult to select a server side for coordination from multiple companies.

In the client/server architecture, each participant must cooperate with the central server to complete joint training, as shown in Figure 1.
 


Figure 1 Client/server architecture of federated learning system


When there are no less than two participants, start the federated learning process. Before starting training,

The central server first distributes the initial model to each participant

Then each participant trains the obtained model according to the local data set

Each participant encrypts and uploads the model parameters obtained from local training to the central server

The central server aggregates all model gradients , and then encrypts the aggregated global model parameters and sends them back to all participants



In a peer-to-peer computing architecture

There is no central server and all interactions are direct between the parties, as shown in Figure 2.
 


Figure 2 Peer-to-peer system architecture for federated learning


After the participants train the original model, they need to encrypt and transmit the local model parameters to other data holders participating in the joint training

Therefore, assuming that there are n participants in this joint training, each participant needs to transmit at least 2 (n-1) encrypted model parameters

In the peer-to-peer architecture, since there is no participation of the third server, the participants interact directly, requiring more encryption and decryption operations

Throughout the process, all model parameter interactions are encrypted

At present, it can be realized by technologies such as secure multi-party computation and homomorphic encryption. The update of global model parameters can use aggregation algorithms such as federated average. When it is necessary to align the participant data, schemes such as sample alignment can be used.

Community and Ecology of Federated Learning

Like other IT technologies, related companies and institutions are also building technical communities and open source ecosystems for federated learning technology, bringing together technical personnel in the federated learning industry to learn and communicate, understand the latest developments in the industry, share and discuss cutting-edge technology.

Since the concept was proposed, federated learning has been displayed in the form of open source. Google shares its work progress and ideas in the Google AI Blog, which is an accurate and reliable source of information for machine learning-related research, although blog posts are usually updated in an informal or conversational manner. Google AI Blog has a section dedicated to introducing machine learning and federated learning research, which provides many new ideas for the development of federated learning. There are also articles sharing federated learning technology on Facebook Blog and NVIDIA Blog.
 



Similar to Google AI Blog, Tencent's Tencent Cloud Community also has columns on big data and artificial intelligence, sharing ideas and ideas about Tencent's federated learning platform Angel PowerFL. The platform fully considers ease of use, efficiency and scalability:

  • Using Apache Spark as the computing engine inside each participant makes it easier to interface with other task flows;
  • Using Apache Pulsar as a message queue for cross-public network transmission can support a large number of network transmissions and has good scalability;
  • Implemented an efficient Paillier ciphertext computing library in C to improve and optimize performance.

In early 2020, the ByteDance related team open-sourced Fedlearner, a federated learning platform, on GitHub. Its model training is mainly based on neural network model training and tree model training. For neural network model training, you only need to add sending operators and receiving operators to the original TensorFlow model code to change it to a model that supports federated training.
 

For example, Ant Group’s technical community shares relevant news and cutting-edge technologies for the entire financial technology field, and also organizes technology-related online live broadcasts, offline sharing and other forms of activities.

Baidu's AI developer community is also similar, divided into multiple sections, involving multiple fields and multiple technical categories, such as image recognition, knowledge graph, augmented reality, etc. Among them, Baidu's open source framework PaddlePaddle involves modules that support the federated learning paradigm .

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/131434304