Privacy Computing + AI Engineering Technology Practice Guide--Overall Introduction

Privacy AI Engineering Technology Practice Guide – Overall Introduction

Topic introduction: Recently, Rosetta, a privacy AI framework based on TensorFlow, has been open-sourced. AI developers do not need to understand privacy protection technologies such as cryptography, but only need to change a few lines of code to give their programs the ability to protect data privacy. This column will use a number of exclusive first-time technical articles to deeply disclose the overall framework design of Rosetta, the best practice of TensorFlow's customized transformation, and the efficient engineering implementation of cryptography theoretical algorithms, etc. Through this series of topics, I hope that more developers will understand the technical challenges of the privacy AI framework, and at the same time provide some experience reference for developers in related fields such as cryptographic protocol algorithm engineering and deep customization of AI framework.

It has become a consensus in the industry that data is the "fuel" of AI technology. More data often means that more accurate models can be trained. But whether it is within a company or between multiple companies, in order to be responsible for user data and comply with laws and regulations, when sharing and using data, we must pay attention to the protection of the original plaintext data. Traditional security measures for static data protection cannot solve the dynamic use of data and privacy leakage in sharing, and it is this actual demand that has spawned privacy computing (in AI scenarios, it can be further called privacy AI ) This new crossover technology, which is integrated in the process of data use, ensures that the calculation process itself (in a broad sense, also includes the calculation results) will not leak information about the original plaintext data itself.

At present, the ways to realize private computing can be divided into several categories such as cryptography, federated learning, and hardware trusted execution environment (TEE). Among them, MPC (Multi-Party Computation, secure multi-party computation) based on cryptography theory is the most secure technical route. Its basic concept is to trust computational complexity theory and code instead of trusting people and hardware. However, federated learning and TEE are still difficult to explain security clearly, and new security loopholes are often discovered. Moreover, the core part of federated learning often needs to use cryptographic methods such as homomorphic encryption to ensure strong security. From an engineering technology perspective, federated learning is an extension of distributed machine learning technology. The main challenge is how to synchronize updates to multiple heterogeneous terminals during the training process [1]. Many traditional distributed system development experiences are still applicable. . The cryptographic approach represented by MPC brings some new challenges.


MPC (picture from the Internet)

The core difficulty is that cryptography belongs to the field of computer theory. Many concepts and algorithm protocols require long-term accumulation of professional knowledge to understand, and the typical AI directions in business implementation, whether it is computer vision, text mining or user behavior construction. Models, etc. are more oriented to actual scenarios. How to break down the barriers between privacy protection technology represented by cryptography and AI technology? This is the core issue that developers need to solve when actually building a universal, easy-to-use privacy computing framework. Specifically, surrounding this core issue is a series of specific engineering and technical challenges:

  • **How ​​to achieve system ease of use? **AI developers will not be willing and should not spend time and effort learning various complex and abstract cryptographic algorithms in order to introduce data privacy protection capabilities into their business. A good privacy AI framework should be easy to use and convenient for AI. Developers can quickly solve their data privacy issues using methods they are familiar with.
  • **How ​​to achieve efficient execution of the system? **This includes both single machine and multi-machine levels. Most cryptographic calculations are performed on the ciphertext of large random numbers. For this reason, it is often necessary to use dedicated hardware instructions, SIMD (Single Instruction/Multiple Data) and other technologies to accelerate single-machine parallelization. These optimization implementations require Have a deep understanding of the basic library of cryptography, and often need further parallel optimization according to the protocol algorithm. At the multi-machine level, it is necessary to consider how to be compatible with the parallel optimization technologies of many AI frameworks.
  • **How ​​to achieve efficient communication between MPC multiple parties? **In MPC, a large amount of synchronous communication is required between multiple parties, and most of the content on the channel is irregular, incompressible, one-time use random numbers. This requires ensuring security while ensuring security. The computing logic undergoes many engineering optimizations to reduce the amount of communication and the number of communications.
  • **How ​​to ensure the scalability of privacy protection technology? **Privacy computing technologies such as MPC are still under continuous development and are also hot issues in academic research. Therefore, a good privacy AI framework needs to be able to support researchers to easily and quickly integrate new algorithm protocols.

There have been some explorations in the industry to address these issues. Let's use Rosetta to talk specifically about how to overcome these challenges in the design and implementation of the privacy AI framework. Due to space limitations, this article mainly introduces the macro design as a whole, and will further analyze some technical details in subsequent series of articles.

Like other privacy AI frameworks, Rosetta is still in the early stages of development and has some imperfections. We use Rosetta as an example here to clarify some of the detailed challenges in this field, and to inspire more developers to participate in the design of future privacy AI systems.

Overall design idea of ​​privacy AI framework

At present, there is no mature and complete privacy AI framework that has been implemented on a large scale, but there are already some exploratory open source privacy AI frameworks, such as PySyft, TF Encrypted, and CrypTen.

Insert image description here
On the whole, these frameworks are encapsulated and integrated in the front-end Python layer of TensorFlow or PyTorch. The advantage of this is that you can directly use the upper-layer interfaces of these AI frameworks to implement private computing algorithms, and you can naturally directly call some high-level API functions encapsulated by the framework itself. This is more suitable for federated learning, a technology that itself originates from distributed machine learning, but it has some shortcomings for cryptographic MPC:

  • First of all, the performance of a single machine cannot be fully improved. Using Python to implement various complex cryptographic calculations and communication between multiple parties will not be able to fully utilize the parallel optimization of the underlying operating system and hardware layer, and more realistically, most high-performance cryptography libraries provide C /C++ interface, if cryptographic technologies such as MPC are implemented in the front-end upper layer of the AI ​​framework, it will be difficult to reuse the results accumulated by the industry over a long period of time (and still developing).
  • Secondly, the implementation of privacy technologies such as cryptographic protocols is too deeply coupled with the AI ​​framework itself, which is not conducive to expansion. Since the external API interfaces provided by these AI frameworks are originally oriented to AI needs, when implementing more complex cryptographic protocols such as MPC, you not only need to be proficient in using the APIs of these frameworks, but also often need to directly use a large number of libraries to implement complex computing logic numpy. , on the one hand, this destroys the self-consistency in the use of the AI ​​framework itself, and all computing logic can no longer be completely carried on the logic execution graph of the AI ​​framework. On the other hand, it also makes it difficult to introduce a new back-end cryptographic protocol every time. All need to be re-implemented based on the AI ​​framework, which is very costly for cryptographic protocol developers.

Based on the above understanding, at this stage, Rosetta first uses TensorFlow, a popular AI framework, to deeply transform its front-end Python entrance and back-end kernel implementation, and encapsulates the pluggable MPC algorithm protocol as a "privacy protection engine" to drive the entire calculation. Secure flow of data during the process.

Why TensorFlow?

TensorFlow and PyTorch are currently the most mainstream open source AI frameworks used in the industry. Although many companies may customize and transform some components internally according to their own needs, or launch new frameworks with unique characteristics to make further breakthroughs in different dimensions such as ease of use, efficiency, and completeness, overall, The basic design paradigms of these frameworks are relatively similar. Most of them use rich interface APIs to allow users to express upper-level computing logic in the form of directed acyclic graph DAG, and the framework itself will perform a series of tasks when actually scheduling and executing these computing tasks. Optimization. Although TensorFlow is slightly inferior to PyTorch in terms of user-friendliness and is often criticized by developers, it is indeed more balanced and comprehensive in terms of scalability, efficiency, distributed deployment, etc. (Of course, this also means that TensorFlow is more complex. , its transformation will be more challenging). Therefore, after comprehensive consideration, Rosetta chose TensorFlow as the basic underlying carrier in the current version. During the design and development process, on the one hand, it will make full use of TensorFlow's inherent calculation graph parallel execution optimization and other functions to improve efficiency. On the other hand, It will also try to restrain itself, mainly to use some of its interface characteristics as a general purpose of the deep learning framework, and not to rely too much on some of its unique components.

Rosetta framework core design ideas

  • **Privacy operator (SecureOp)** connects AI framework and privacy computing technology as the core abstract interface. TensorFlow provides a variety of expansion methods at different levels. Rosetta chooses the back-end operator ( Operation) layer as the core entry point. When executing the operator, TensorFlow will dynamically bind it to the SecureOp implementation in the specific MPC protocol. Through such abstraction, cryptographic protocol developers do not need to understand the AI ​​framework and only need to use C++ to implement various functional functions that meet the interface definition. AI developers do not need to have an in-depth understanding of the implementation details of technologies such as MPC, but only need to On the basis of existing operators, you can further encapsulate the upper-level advanced functions you want.

  • Phased conversion based on **optimization pass (Pass)**. In order to provide AI developers with an easy-to-use interface as much as possible and reduce the transformation cost of giving data privacy protection capabilities to online AI programs, Rosetta draws on the core concept in the field of program compilers: Pass in the overall design. Pass is a commonly used technology in compilers. It is mainly used for multiple rounds of conversion and optimization in the process of converting source code files into machine code step by step. In Rosetta, the DAG (Directed Acyclic Graph) logic calculation graph written by the user using the native TensorFlow interface will be converted in stages and replaced by an MPC program executed by multi-party collaboration, which can achieve the least changes to the user API layer . Specifically, in Rosetta, there are two stages of Pass, a Static Pass that takes effect during the global DAG construction process of the front-end Python layer, which will convert native to support custom Tensorciphertext type RttTensor, and convert native Operationto support tf.stringformat input output RttOp, and ultimately further converted to host actual MPC operations when the graph starts SecureOp.

Insert image description here

The other is SecurOpthe Dynamic Pass processing performed during actual execution. The corresponding actual operator implementation will be dynamically selected for execution based on the protocol selected by the current user. At the same time, optimization processing based on the execution context can be embedded at this time.

Insert image description here

Distributed privacy AI architecture integrating MPC technology

Understanding the overall distributed structure is of great benefit to understanding the architecture of a system. The external interface of the entire privacy AI system will involve three aspects. How to specify the network topology on physical deployment? How is data safely input, flowed, and output throughout the entire computing process? How to express privacy computing logic? The overall logical structure of Rosetta is shown in the figure below:

Insert image description here
Rosetta multi-party network structure diagram

Establishment of multi-party network

MPC technology itself requires multi-party participation, and they are generally called Players. Different MPC algorithm protocols will have different numbers of participants. Taking the current tripartite protocol SecureNN[2] implemented in Rosetta as an example, the system There are three logical parties in , P0, P1 and P2.

In version v0.2.1, in order to ensure external flexibility at the user interface level, the user is currently supported to specify the network relationship between multiple machines at one time through the configuration file, and also supports the dynamic activation of the call interface, Unravel the network topology between multiple parties:

# 调用activate接口会根据配置参数或配置文件建立起网络
rtt.activate(protocol_name="SecureNN", protocol_config_str=None)

# 调用deactivate接口会释放网络链接等资源
rtt.deactivate()

In the internal implementation, each participant will listen to a local server port, and at the same time establish a client network link between the other two parties. The advantage of this is that the network link relationship between each other is simple and clear. Of course, it also needs Solve the consequent consistency problem during concurrent and SecureOpsynchronous execution, which we will also discuss in a subsequent article.

Some points to note

  • Readers who are familiar with TensorFlow may wonder, isn’t this mode of multi-party running the same program based on different data the In-graph replication and Between-graph replication that support data parallelism under TensorFlow distributed execution? This is not the case. In fact, they are structures of different levels. Here we are talking about the MPC participants from the perspective of upper-level logic. In practice, you can even further implement the task executed internally by each party according to TensorFlow's distributed specification. Carry out cluster deployment, and use the "server" in the cluster as a unified external representative.

  • The above has been talking about "logically" three parties. So in actual business scenarios, it may be data cooperation between 2, 4 or more companies. Is it true that these architectures cannot be used? In fact, it is not the case. We can perform a layer of mapping on the upper layer and provide services to the upper layer in the form of Privacy-as-a-Service. This will be further introduced in subsequent articles.

The flow of private data

Each logical participant can have its own private plaintext input data, and can also continue to process the ciphertext results output by the previous task. During the entire running process of the program, the data will only exist in plain text at the beginning and end: private data is introduced at the beginning, and at the end it is configurable whether to recover the results in plain text and output them. During the calculation process of each operator in the middle, data is always interacted in the local logical context and between multiple parties in the form of ciphertext.

In terms of external interfaces, in actual business, data from multiple parties needs to be correlated and aligned. Currently, Rosetta provides two common data set processing methods. One is that the data set as a whole is "horizontally divided" between the parties. The scenario is that all parties have all feature attribute values ​​with different sample IDs; the other scenario corresponds to the overall data set being "vertically divided" between the parties, that is, the parties have some feature attribute values ​​with the same sample ID. . These can be conveniently processed by calling the interface PrivateDatasetof the class . load_dataIn the output stage, the following two interfaces are provided:

# 将一个密文形式的cipher_tensor恢复为明文, receive_party参数指定3方中哪几方可以获得明文结果
rtt.SecureReveal(cipher_tensor, receive_party=0b111)

# 与原生TensorFlow中模型保存接口SaveV2具有一样的函数原型,可通过配置文件指定哪几方可以获得明文模型文件
rtt.SecureSaveV2(prefix, tensor_names, shape_and_slices, tensors, name=None)

Private Set Intersection (PSI) technology

In actual scenarios, there is still a very real problem, which is the alignment of samples between multiple parties, such as matching the sample pointed to by the sample ID of Party A with the attribute information corresponding to the sample ID of Party B. PSI technology can safely solve the above problems. Currently, this function has not been well integrated into various open source frameworks. Rosetta is currently integrating this function and will be released in a recent version.

In the internal implementation, many operations in cryptography are operations on abstract algebraic structures such as rings, fields, and lattices with large spaces. However, in the code, they are implemented specifically for large spaces. Processing of data structures such as integers and polynomials, so the framework design must achieve a balance in three aspects:

  1. Make the internal ciphertext data structure as transparent to the user as possible;
  2. TensorFlow's core functions such as DAG graph construction and automatic derivation still require seamless support;
  3. Support different MPC protocols using custom specific data structure objects for easy expansion.

In order to achieve these goals at the same time, Rosetta uses tf.stringthis TensorFlow native data structure to carry the customized ciphertext data of each protocol, and then performs in-depth hook transformation on the TensorFlow source code so that functions such as DAG graph construction and automatic derivation are still available.

Execution of DAG

As shown in the above network structure diagram, each Player runs the same AI binary code written based on TensorFlow, such as a program for training a simple neural network model. Users directly use the native operator API in TensorFlow to build a logical calculation graph DAG. Rosetta will internally complete the SecureOpconversion to privacy operators when the graph starts executing. Compared with other privacy computing frameworks, such switching cost is the lowest.

During the execution process, each Player itself is running according to this DAG graph. The special thing is that during the internal execution of each operator, each Player will perform different operations according to its role and follow the MPC protocol. These operations That includes local processing on ciphertext, as well as strong synchronous communication and interaction between multiple parties, and transmission of ciphertext in the form of a large number of random numbers.

Insert image description here
Rosetta multi-party network structure diagram

summary

In this article, combined with the Rosetta framework, we introduce the challenges that the privacy AI framework needs to face when the project is implemented, as well as some design solutions for frameworks such as Rosetta. In subsequent articles, we will further introduce the core key modules. The Rosetta framework has been open sourced on Github (https://github.com/LatticeX-Foundation/Rosetta). Welcome to follow us.

References

  1. Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., … & Van Overveldt, T. (2019). Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046.

  2. Wagh, S., Gupta, D., & Chandran, N. (2018). SecureNN: Efficient and Private Neural Network Training. IACR Cryptol. ePrint Arch., 2018, 442.

, D., & Chandran, N. (2018). SecureNN: Efficient and Private Neural Network Training. IACR Cryptol. ePrint Arch., 2018, 442.

Guess you like

Origin blog.csdn.net/Matrix_element/article/details/108748065