In-depth understanding of federated learning - the emergence of the concept of federated learning

Category: General Catalog of "In-depth Understanding of Federated Learning"


Since the Dartmouth Conference in 1955, artificial intelligence has experienced two ups and downs and ushered in its third peak period. The emergence of the first peak is because people see the hope of AI, that is, the hope of automated algorithms to improve efficiency, but due to the limitation of algorithm capabilities, machines cannot complete large-scale data training and complex tasks, and AI has entered the first trough. The second peak comes from the proposal of the Hopfield neural network, and the BP algorithm realizes the breakthrough of neural network training, making large-scale neural network training possible. However, it was found that the computing power and data were not enough, and the design of the expert system could not keep up with the growth needs of the industry, which triggered the second trough of AI. In 2006, the deep word learning neural network was proposed, coupled with the huge improvement of algorithms and computing power and the emergence of big data in recent years, artificial intelligence ushered in the third peak. In AlphaGo in 2016, it used a total of 300,000 chess games as training data and defeated two human professional Go players one after another. We have really seen the great potential of artificial intelligence, and we are even more hopeful that artificial intelligence technology can be used in autonomous driving. More, more complex, and more cutting-edge fields such as medical care and finance will show their strengths.

The great success of AIphaGo has made people naturally hope that this kind of big data-driven artificial intelligence will be realized in all walks of life. But the real situation is very disappointing: In addition to a limited number of industries, there are problems of limited and poor quality data in more fields, which is not enough to support the realization of artificial intelligence technology. More application fields have only small data, or data with poor quality. This misconception that "artificial intelligence is available everywhere" can lead to serious business consequences. A case is IBM's Watson, a very famous question answering (QA) system, that is, given a question QQQ , it can find the answer AAvery accuratelyA. _ Watson can express this problem with a high-dimensional representationQQQ , this representation can be compared to a spectrum in physics. A prism decomposes a beam of light into light of different frequencies to form a spectrum. With this spectrum, you can correspond to the answer in the answer library, and the answer with a correspondingly high probability is the possible answer. The whole process should be said to be very simple, but the problem is to have a very robust answer library. After IBM's success in the TV competition, it applied this to some vertical fields that sounded better—the medical field. However, a recent US cancer treatment center found this application to be far from ideal, leading to the failure of the project. We can take a look at the medical field, where do the questions and answers in these fields come from? For example, enter symptoms, gene sequences, pathology reports, various tests, and various papers. Watson's task is to use these data to make diagnoses and help doctors. However, after a period of practice, it is found that the sources of these data are far from enough, resulting in a poor system effect. The medical field requires a lot of labeled data, but the time of doctors is very precious, and it cannot be done by ordinary people like some other computer vision applications. Therefore, in a professional field such as medical care, such labeled data is very limited. Some people estimate that it will take 10,000 people to collect effective data for as long as 10 years if medical data is marked by a third-party company. This shows that in these fields, even if many people are used for labeling, the data is not enough. This is the reality we face.

At the same time, there are barriers that are difficult to break between data sources. Generally, the data required by artificial intelligence will involve multiple fields. For example, in artificial intelligence-based product recommendation services, product sellers own product data, and users purchase goods data, but no data on user purchasing ability and payment habits. In most industries, data exists in the form of isolated islands . Due to issues such as industry competition, privacy security, and complex administrative procedures, even data integration between different departments of the same company faces many obstacles. In reality, In China, it is almost impossible to integrate the data scattered in various places and institutions, or the cost required is huge.

On the other hand, with the further development of big data, it has become a worldwide trend to attach importance to data privacy and security. Every public data leak will attract great attention from the media and the public. For example, the recent Facebook data leak has caused large-scale protests. At the same time, all countries are strengthening the protection of data security and privacy. The General Data Protection Regulation (GDPR), which was officially implemented by the European Union in 2018, shows that the increasingly strict management of user data privacy and security will be a global trend. This has brought unprecedented challenges to the field of artificial intelligence. The current situation in the research and business circles is that the party that collects data is usually not the party that uses the data. For example, party A collects data, transfers it to party B for cleaning, and then transfers it to party C for modeling. Finally, sell the model to Party D for use. This form of data transfer, exchange and trade between entities violates the GDPR and can be severely punished by the Act. Similarly, the "Network Security Law of the People's Republic of China" and the "General Principles of the Civil Law of the People's Republic of China" that China implemented in 2017 also pointed out that network operators must not disclose, tamper with, or destroy the personal information they collect, and conduct data transactions with third parties It is necessary to ensure that the proposed contract clearly stipulates the scope of the data to be traded and the data protection obligations. The establishment of these regulations has brought new challenges to the traditional data processing mode of artificial intelligence to varying degrees. On this issue, the academic and business circles of artificial intelligence currently have no good solutions to deal with these challenges.

To solve the dilemma of big data, bottlenecks have emerged only by traditional methods. The simple exchange of data between two companies is not allowed under many regulations including GDPR. The user is the owner of the original data, and the data cannot be exchanged between companies without the approval of the user. Second, the purpose of data modeling cannot be changed before the user approves it. Therefore, many attempts at data exchange in the past, such as the data exchange of data exchanges, also require huge changes to comply. At the same time, the data owned by commercial companies often has huge potential value. Two companies and even departments between companies must consider the exchange of interests. Under this premise, these departments often do not simply aggregate data with other departments. This results in data often appearing in silos, even within the same company.

How to design a machine learning framework so that artificial intelligence systems can use their own data more efficiently and accurately while meeting data privacy, security, and regulatory requirements is an important topic in the current development of artificial intelligence. Federated learning is to solve the problem of data islands, and it proposes a feasible solution that meets privacy protection and data security:

  • The data of all parties is kept locally, without revealing privacy or violating regulations
  • A system in which multiple participants combine data to establish a virtual common model and benefit from it together
  • Under the system of federated learning, the identity and status of each participant are equal
  • The modeling effect of federated learning is the same as that of modeling the entire data set in one place, or not much different (under the condition of User Alignment or Feature Alignment alignment of each data);
  • Migration learning is the effect of knowledge transfer by exchanging encryption parameters between data when users or features are not aligned.

Federated learning enables multiple parties to continue machine learning while protecting data privacy and meeting legal and compliance requirements, solving the problem of data islands

References:
[1] Yang Qiang, Liu Yang, Cheng Yong, Kang Yan, Chen Tianjian, Yu Han. Federated Learning [M]. Electronic Industry Press, 2020 [2] WeBank, FedAI.
Federated Learning White Paper V2.0. Tencent Research Institute, etc., 2021

Guess you like

Origin blog.csdn.net/hy592070616/article/details/132675072