Data leasing - a new way of data circulation

Data leasing - a new way of data circulation

Ruan Wenqiang1,2 , Xu Mingxin1,2 , Tu Xinyu1,2 , Song Lushan1,2 , Han Weili1,2

1 Data Analysis and Security Laboratory, Fudan University, Shanghai 200438

2 Shanghai Key Laboratory of Data Science, Shanghai 200438

Abstract : Data is becoming a new factor of production to promote social development. Making data circulate between multiple parties in a compliant and auditable manner is critical to the formation of data value. From the perspective of privacy protection and data utilization, a new data circulation method - data leasing is proposed. Firstly, the motivation for proposing data leasing is introduced, then five requirements that data leasing should meet are clarified, and finally a data leasing technology based on secret sharing is proposed.

Key words : data circulation; secret sharing; data leasing; privacy protection

868a31f63bbf1e319efe7e57e06d806a.jpeg

Paper citation format:

Ruan Wenqiang, Xu Mingxin, Tu Xinyu, et al. Data Leasing——A New Way of Data Circulation[J]. Big Data, 2022, 8(5): 3-11.

RUAN W Q, XU M X, TU X Y, et al. Data tenancy: a new paradigm for data circulation[J]. Big Data Research, 2022, 8(5): 3-11.

52a4e391d75c8c9fe1a8339d5097f4e1.jpeg

0 Preface

Data has been juxtaposed with traditional production factors such as capital, land, labor, and technology, and has become a new type of production factor. In the process of forming data value, data circulation plays an extremely important role. The current methods of data circulation mainly include data disclosure and data transactions by government departments or enterprises. However, with the "Network Security Law of the People's Republic of China" (hereinafter referred to as the "Network Security Law"), the "Data Security Law of the People's Republic of China" (hereinafter referred to as the "Data Security Law"), and the "Personal Information Protection Law of the People's Republic of China" ( (hereinafter referred to as the "Personal Information Protection Law"), it is difficult for data related to user privacy to be directly circulated among various institutions. In addition, many institutions may not be willing to directly transmit raw data to other institutions for the purpose of commercial competition. At present, the scenario that has received more attention is how to enable multiple institutions to jointly use data in a privacy-protected manner, that is, each institution contributes data and obtains the results of data analysis, and how to enable an institution to "lease" There is still a lack of corresponding research on mining the value contained in the data of other institutions. Therefore, in order to promote the full formation of data value, this paper proposes a new way of data circulation - data tenancy.

Data leasing enables the data leasing party to use the data of the data lessor to complete pre-agreed computing tasks (such as machine learning model training) and obtain computing results in a paid, privacy-protected, and auditable manner. "Data gains value. According to laws and regulations related to privacy protection, this paper discusses the motivation and definition of data leasing, and clarifies five requirements that data leasing needs to meet. Subsequently, this paper proposes a data leasing technology based on secret sharing, so that the data scattered in various institutions can be better circulated through "leasing", thereby promoting the formation of data value.

1 Relevant knowledge and existing research

1.1 Secure multi-party learning technology based on secret sharing

Secure multi-party learning is a privacy-preserving machine learning technology based on secure multi-party computing. The secure multi-party learning technology based on secret sharing enables multiple participants to jointly train a pre-agreed machine learning model (the training process is represented by a Boolean circuit or arithmetic circuit), and guarantees that no private information other than the resulting model will be disclosed . As shown in Figure 1, D 1 , D 2 , and D n represent the privacy data sets of participant 1, participant 2, and participant n respectively. In an n-party secure multi-party learning process based on secret sharing, the participant i first decomposes the private data set (D i ) it holds into n secret shares, and then distributes the secret shares of the data set to other participants. At the same time, in some scenarios, some participants may not send secret shares to other participants, but only receive secret shares from other participants. After the distribution of the secret share of the data set is completed, all participants use the secure multi-party computing protocol to jointly generate a randomized initial model parameter, and then enter a secure multi-party computing process based on secret sharing, through local computing and interactive communication, using the secret of the data The share completes the training of the target model, and finally each participant gets a secret share of the target model. Subsequently, depending on the specific scenario, the participants can choose not to restore the target model, but still do it interactively when reasoning on the data, or restore the target model to plaintext by exchanging their respective secret shares. Currently, there are two popular secret sharing techniques for secure multi-party learning: additive secret sharing and Shamir secret sharing. Among them, additive secret sharing can support two or more participants, and Shamir secret sharing supports three or more participants.

5055b27f3df9e81e0354ec22594046be.jpeg

Figure 1 An example of a secure multi-party learning process based on secret sharing

The secure multi-party learning technology based on secret sharing has the following four characteristics: ①All participants can only get the result model, but not any information input by other participants; ②All participants jointly train a pre-agreed, training process A target model that can be represented by a circuit (arithmetic circuit or Boolean circuit); ③ All participants need to participate in the training process; ④ The resulting model can be held by all participants, or only one or some participants, that is, all Participants send their held secret shares of the resulting model to the parties that have the right to recover the final resulting model. After obtaining the secret shares of other parties, the party with the right to restore the final result model will restore the final result model.

1.2 Security Model

The data leasing technology proposed in this paper adopts a semi-honest security model, that is, each participant will perform calculations according to the steps stipulated in the agreement, and send pre-defined information to other participants, but the participants will try to learn from the received information. Infer the input information of other participants. Since the current purpose of using secure multi-party learning technology among participants is to meet the requirements of privacy protection laws and regulations on data circulation, the semi-honest model is a suitable method for practical scenarios on the premise that all participants are willing to share data. security model.

1.3 Related research work

As countries around the world have promulgated laws and regulations related to personal information protection, such as the European Union issued the "General Data Protection Regulation" in 2018, and my country issued the "Personal Information Protection Law" in 2021, etc., the flow of data involving user privacy has been restricted. severely restricted. In recent years, in order to fully tap the value hidden in data from different organizations under the premise of compliance, researchers have proposed and implemented many privacy computing algorithms and systems, enabling multiple data lessors to share data in a privacy-protected manner. Carry out joint modeling and analysis on the data of all parties to achieve the goal of "data is available but not visible". Privacy computing technologies currently receiving more attention include secure multi-party learning technology, federated learning, etc.

In 2017, Mohassel P et al proposed the first secure multi-party learning system that supports neural network model training - SecureML. Subsequently, researchers proposed and implemented many secure multi-party learning systems, including ABY 3 and Fantastic-Four, which support more participants and are more efficient, SWIFT, BLAZE, etc., which support malicious participant models, and CryptGPU, which supports complex model training and reasoning. , Falcon, etc. In these existing secure multi-party learning systems, the identities of each participant are equal, they all need to provide data and can get the calculation result after the calculation is completed. The frameworks and mechanisms by which one agency conducts analytics on other agencies' data in a privacy-preserving, auditable "lease" manner require further research.

In addition, Google proposed the concept of federated learning in 2015. Subsequently, many companies launched federated learning-based joint modeling systems, such as TensorFlow Federated released by Google, and FATE (federated AI technology enabler) launched by WeBank. Compared with secure multi-party learning systems, federated learning-based systems have higher efficiency, but also have higher privacy risks. For example, intermediate results transmitted between participants are likely to leak relevant private information of input data. At the same time, there is currently no mathematical model for quantitative analysis of the privacy risks of the federated learning system. In addition, the joint modeling of the data of all parties based on the federated learning system may cause a certain loss in the accuracy of the obtained model, especially when the data of all parties are non-independent and identically distributed, federated learning will cause a large loss of accuracy .

2 Overview of data leasing

2.1 Motivation for Data Leasing

The current main way of data circulation is data transactions between different institutions, that is, data buyers obtain data from data sellers by paying a certain fee. After paying a certain fee to the data seller, the data buyer can directly obtain the data and perform arbitrary analysis operations on it. At present, many data trading platforms have been produced in China. Although data transactions play an important role in promoting data circulation, there are still two limitations that prevent data from being fully circulated in some scenarios, as follows.

● The data that needs to be circulated may contain users’ private information. With the successive promulgation of the "Network Security Law", "Data Security Law" and "Personal Information Protection Law", the direct transfer or transmission of these data may bring risks to organizations that sell data. serious legal risk.

● For commercial competition and other purposes, organizations or individuals holding data may not wish to directly send the data to other organizations, but may allow other organizations to perform some specific and less sensitive computing operations on all of their data.

When the data is sensitive and cannot be directly circulated between institutions, data leasing can use a privacy-preserving and auditable method, enabling the data leasing party to use the data of the data lessor to complete specific computing tasks, thereby promoting data sharing. The value is fully formed.

2.2 Definition of Data Lease

Referring to the traditional definition of asset leasing, and considering the unique form of data assets and various privacy protection laws that have been released, the definition of data leasing in this paper is as follows: Data leasing means that within the agreed time, the data lessor uses its Some data assets complete the specific computing tasks required by the data lessor. In the end, the data lessor only obtains the calculation results, and the data lessor obtains the rent.

Since the cost of data replication is almost zero, and it involves users’ privacy information, it is protected by law. When data is used as the object of lease, the data lessor cannot directly transfer data assets to data lease within a period of time like traditional asset lease. The party can only obtain the benefits brought by the leasing data by completing the computing tasks specified by the data leasing party.

In addition, compared with data sharing, which is defined as "allowing users who use different computers and different software in different places to read other people's data and perform various operations, calculations, and analysis", data leasing has the following three differences: ① data leasing The data of the data lessor cannot be directly read by the data lessor, and the data lessor can only obtain the output of the calculation task; ②The data lessor can price the rent according to the calculation task of the data lessor; ③Both the data lessor and the data lessor It is necessary to supervise the calculation process to ensure that the data leasing transaction is carried out according to the pre-agreed process. To sum up, compared with data sharing, data leasing brings more requirements, and these requirements bring more and greater technical challenges to the realization of data leasing.

2.3 Characteristics of data leasing

According to the definition of data leasing, when designing a data leasing framework, it should be able to meet the following five requirements.

● Valuable: According to the complexity of the target calculation task and the number of times the data is used, the lease fee that the data lessor should pay to the data lessor can be calculated.

● Privacy: The data lessor does not directly transmit plaintext data to other organizations. In order to avoid potential legal risks, the data of the data lessor should be kept locally to prevent leakage of user privacy information.

● Effectiveness: The data lessor can use the data of the data lessor to work with the data lessor to complete the calculation tasks agreed by both parties in advance and obtain the calculation results. During the calculation of data leasing, the data of the data leasing party may also participate in the calculation. It is worth noting that there may be multiple data lessors renting data from an institution at the same time to complete their target computing tasks.

● Calculation process can be supervised: Both the data lessor and the data lessor should be able to supervise the calculation operation, that is, both the data lessor and the data lessor should be able to ensure that the other party performs pre-agreed calculation operations on the data. By ensuring the supervisability of the computing process, the data lessor can charge corresponding rental fees according to the type and quantity of computing operations, and the data lessor can ensure that it can use data from other institutions to complete specific computing tasks.

● Auditable: The calculation operations performed by the data lessor and the data lessee on the data should be able to be audited by a third party, so as to avoid that after the calculation task is completed, the two parties cannot reach an agreement on the type and quantity of the completed calculation operations, resulting in payment There was a dispute between the two parties when renting.

3 Design of data leasing technology based on secret sharing

Although other privacy computing technologies (such as federated learning, etc.) can achieve a certain degree of privacy protection, however, these technologies lack theoretical guarantees for the privacy protection provided by themselves, and secure multi-party learning uses secure multi-party computing technology to complete the underlying calculations, which can provide a comprehensive foundation for the computing process. Provide strict security guarantees. Therefore, this paper proposes a data leasing technology based on secret sharing, which enables the data lessor and the data lessor to participate in a secure multi-party learning process based on secret sharing to complete the pre-agreed computing tasks between the data lessor and the data lessor. Next, the role and calculation process involved in the data leasing technology proposed in this paper are introduced in detail, and how the technology meets the four requirements of privacy, validity, supervision and auditability of the calculation process. As for the priceable demand, because it is decoupled from the subsequent calculation process, and there are already many research works related to data pricing, such as methods based on game theory, this paper does not discuss how to meet this demand. Compared with the existing data security outsourcing computing method based on homomorphic encryption, the data leasing technology based on secret sharing proposed in this paper enables the data lessor and the data leasing party to supervise each other's computing operations by participating in the computing process. In addition, by introducing blockchain technology, the data leasing technology proposed in this paper enables the third party to audit the transaction information after the transaction is completed, which can avoid the data leasing party or the data leasing party's denial.

3.1 Role Definition

In the data leasing technology based on secret sharing proposed in this paper (as shown in Figure 2), there are three types of roles, namely, the data leasing party, the data lessor, and the leasing platform party, as follows.

57d05e01d6083b0f264a766b15a4a1c0.jpeg

Figure 2 Three types of roles in the data leasing technology based on secret sharing

● The data leasing party. The data leasing party may own part of the data itself, and hopes to rent the data of the data lessor by paying a fee, so as to obtain more effective information through joint multi-party data mining. The data lessor needs to explain its target computing task to the data lessor and the leasing platform, and complete the computing task through secure multi-party learning based on secret sharing.

● Data lessor. The data lessor leases the data it needs to the data lessor, and charges corresponding fees according to the complexity of the calculation tasks completed by the data lessor using its data and the number of times the data is used. In a data lease, multiple data lessors may participate. By participating in a secure multi-party learning process based on secret sharing with the data lessor, the data lessor completes the target computing tasks of the data lessor and supervises the calculation operations performed by the data lessor on its data.

● Leasing platform side. The leasing platform party is responsible for providing an information platform for data leasing and auditing data leasing transactions. The leasing platform side receives and releases data information from the data lessor, and at the same time responds to the data information query request of the data lessor, prompting the formation of data leasing transactions.

3.2 Learning process

After the data lessor and the data lessor reach a consensus on the leased data type and quantity, target computing tasks, and lease fees, the data lessor and the data lessor jointly participate in a secure multi-party learning process based on secret sharing to complete the data lease transaction , the specific process is shown in Figure 3. In the calculation process shown in Figure 3, each party first uses the data held by itself to generate a secret share through the secret sharing technology, and then distributes the secret share to other participants as input, and then all parties pass a secure multi-party algorithm based on secret sharing. The learning process completes the target calculation task, and finally returns the calculation result to the data leaser.

091c2a1be4a80f524cc11b0e4f2a55a2.jpeg

Fig.3 Calculation process of data leasing technology based on secret sharing

Specifically, the data leasing party first converts its target computing task into a circuit representation (a Boolean circuit composed of AND gates, OR gates, and NOT gates or an arithmetic circuit composed of multiplication gates and addition gates), and then sends the circuit to The other parties serve as input for the subsequent calculation process. At the same time, the data leasing party needs to calculate the digital summary of the target circuit and upload it to the blockchain, so that after the data leasing transaction is completed, the third party can audit the transaction based on the data on the chain. If the data of the data lessor needs to participate in computing tasks, it will use the secret sharing technology to generate a secret share of its own data, and then distribute the corresponding secret share to other participants. After the data lessor uses the secret sharing technology to generate the secret share of its own data, it distributes the corresponding secret share to other participants as the input of the subsequent calculation process to complete the "lease" of the data. After the data lessor and the data lessor obtain the secret share of the input data and the circuit representation of the computing task, they use the secret sharing-based secure multi-party learning technology to calculate the target circuit with their own secret shares through local computing and communication interactions. The input is the secret share held by each party. When calculating the target circuit, all parties first disassemble the target circuit into multiple circuit layers according to the dependencies between the gate circuits. The input of each circuit layer comes from the previous circuit layer, and the output is transmitted to the next circuit layer. Subsequently, the target circuit is calculated layer by layer, that is, the gate circuits contained in each layer are calculated in turn, and the output of the last circuit layer is the secret share of the calculation result. Among them, the NOT gate and the addition gate can complete the calculation locally, while the AND gate, the OR gate, and the multiplication gate need to complete the calculation through the interaction between all parties. Finally, the data lessor sends the secret shares of the calculation results held by them to the data lessor, and the data lessor uses the received secret shares to restore the calculation results, and pays the corresponding rent to the data lessor to complete the data leasing transaction.

3.3 Analysis

Next, the calculation process is analyzed to show that it can meet the four requirements of privacy, validity, supervision and auditability of the calculation process that data leasing technology should meet.

● Privacy. The data of the data leasing party and the data lessor both use secret sharing technology to generate secret shares, and then distribute the secret shares to other participants, and all subsequent calculations are completed using secure multi-party learning technology based on secret sharing. According to the characteristics of secure multi-party learning based on secret sharing, all participants cannot obtain the data information of other participants during the calculation process, thus ensuring the privacy of the data lessor's data.

● Validity. The secure multi-party learning technology based on secret sharing can support the joint calculation of multiple participants, so that the data leasing party and the data lessor can jointly complete the pre-agreed computing tasks based on the input data of multiple parties. In the end, the data leasing party gets the calculation result, which ensures the validity of the data leasing transaction.

● The calculation process can be supervised. Secure multi-party learning based on secret sharing technology requires all participants to know the circuit corresponding to the computing task and participate in the computing during the computing process. Therefore, in the above calculation process, all calculations require the joint participation of the data lessor and the data lessor, so that the data lessor and the data lessor can supervise the calculation operations performed by each other.

● Auditable. As shown in Figure 3, before the computation starts, the data renter uploads the digest of the target circuit to the blockchain. After the calculation is completed, a third party (such as the leasing platform) can audit the completed data leasing transaction by checking the data summary on the blockchain.

4 Conclusion

Based on the currently published privacy protection laws and regulations, this paper proposes a new data circulation method—data leasing, analyzes five requirements that data leasing should meet, and proposes a data leasing technology based on secret sharing, aiming at It is further promoting the circulation of data and the formation of data value. In the future, how to enable the data leasing party to test the data of the data lessor before the start of leasing may become the next development direction of data leasing technology, and researchers need to conduct more in-depth exploration and research.

About the Author

Ruan Wenqiang (1999-), male, PhD student at the School of Computer Science and Technology, Fudan University. His main research directions are privacy-preserving machine learning and differential privacy based on secure multi-party computation.

Xu Mingxin (1997-), male, a master student at the School of Software, Fudan University. His main research directions are privacy-preserving machine learning and differential privacy based on secure multi-party computation.

Tu Xinyu (1999-), male, a master student at the School of Software, Fudan University. His main research directions are privacy-preserving machine learning and secret sharing based on secure multi-party computation.

Song Lushan (1999-), female, Ph.D. student at the School of Computer Science and Technology, Fudan University. Her main research directions are privacy protection and machine learning based on secure multi-party computation.

Han Weili (1975-), male, Ph.D., professor at the School of Computer Science and Technology, Fudan University. His main research direction is data security and access control.

contact us:

Tel:010-81055448

       010-81055490

       010-81055534

E-mail:[email protected] 

http://www.infocomm-journal.com/bdr

http://www.j-bigdataresearch.com.cn/

Reprint and cooperation: 010-81055307

Big Data Journal

The bimonthly "Big Data Research (BDR)" is a journal published by Beijing Xintong Media Co., Ltd. , has been successfully selected into the core journals of China's science and technology, the journal of the China Computer Federation, the Chinese science and technology journals recommended by the China Computer Federation, the classified catalog of high-quality scientific and technological journals in the field of information and communication, and the classified catalog of high-quality scientific and technological journals in the field of computing, and has been rated as the National Science and Technology Journal for many times. The most popular journal in the discipline of "Comprehensive Humanities and Social Sciences" in the academic journal database of the Philosophy and Social Sciences Documentation Center.

b2aa6b8737a7fab27ae042722914ddaf.jpeg

Follow the WeChat public account of "Big Data" journal to get more content

Guess you like

Origin blog.csdn.net/weixin_45585364/article/details/127255434