RPC Talk: The Serialization Problem

RPC Talk: The Serialization Problem

what is sequence

For a computer, all data is a binary sequence. However, in order to process these binary data in a human-readable and controllable form, programmers invented the concept of data type and structure. The data type is used to mark the parsing method of a piece of binary data, and the data structure is used to mark multiple segments (continuous/discontinuous) ) how the binary data is organized.

For example the following program structure:

type User struct {
    
    
	Name  string
	Email string
}

Name and Email respectively represent two independent (or continuous, or discontinuous) memory spaces (data), and the structure variable itself also has a memory address.

In a single process, we can exchange data by sharing the structure address. However, if we want to transmit the data to the process of other machines through the network, we need to encode the different memory spaces in the User object into a continuous binary representation, which is called "serialization". After the peer machine receives the binary stream, it needs to be able to recognize the data as a User object and parse it into the internal representation of the program, which is called "deserialization".

Serialization and deserialization are to convert the same data between the human perspective and the machine perspective.

serialization process

insert image description here

Define Interface Description (IDL)

In order to transmit data description information, and also for the specification of multi-person collaboration, we generally define the description information in a definition file written by IDL (Interface Description Languages), such as the following IDL definition of Protobuf:

message User {
    
    
  string name  = 1;
  string email = 2;
}

Generate stub code

No matter what serialization method is used, the ultimate goal is to become an object in the program. Although the serialization method is often language-independent, this section binds the memory space with the internal representation of the program (such as struct/class) The specific process is language-dependent, so many serialization libraries need to provide corresponding compilers to compile IDL files into stub codes in the target language.

Stub code content is generally divided into two parts:

Type structure generation (that is, Struct[Golang]/Class[Java] of the target language)
serialization/deserialization code generation (converting the binary stream to the target language structure)
The following is a serialization Stub code generated by Thrift :

type User struct {
    
    
  Name string `thrift:"name,1" db:"name" json:"name"`
  Email string `thrift:"email,2" db:"email" json:"email"`
}

//写入 User struct
func (p *User) Write(oprot thrift.TProtocol) error {
    
    
  if err := oprot.WriteStructBegin("User"); err != nil {
    
    
    return thrift.PrependError(fmt.Sprintf("%T write struct begin error: ", p), err) }
  if p != nil {
    
    
    if err := p.writeField1(oprot); err != nil {
    
     return err }
    if err := p.writeField2(oprot); err != nil {
    
     return err }
  }
  if err := oprot.WriteFieldStop(); err != nil {
    
    
    return thrift.PrependError("write field stop error: ", err) }
  if err := oprot.WriteStructEnd(); err != nil {
    
    
    return thrift.PrependError("write struct stop error: ", err) }
  return nil
}

// 写入 name 字段
func (p *User) writeField1(oprot thrift.TProtocol) (err error) {
    
    
  if err := oprot.WriteFieldBegin("name", thrift.STRING, 1); err != nil {
    
    
    return thrift.PrependError(fmt.Sprintf("%T write field begin error 1:name: ", p), err) }
  if err := oprot.WriteString(string(p.Name)); err != nil {
    
    
  return thrift.PrependError(fmt.Sprintf("%T.name (1) field write error: ", p), err) }
  if err := oprot.WriteFieldEnd(); err != nil {
    
    
    return thrift.PrependError(fmt.Sprintf("%T write field end error 1:name: ", p), err) }
  return err
}

// 写入 email 字段
func (p *User) writeField2(oprot thrift.TProtocol) (err error) {
    
    
  if err := oprot.WriteFieldBegin("email", thrift.STRING, 2); err != nil {
    
    
    return thrift.PrependError(fmt.Sprintf("%T write field begin error 2:email: ", p), err) }
  if err := oprot.WriteString(string(p.Email)); err != nil {
    
    
  return thrift.PrependError(fmt.Sprintf("%T.email (2) field write error: ", p), err) }
  if err := oprot.WriteFieldEnd(); err != nil {
    
    
    return thrift.PrependError(fmt.Sprintf("%T write field end error 2:email: ", p), err) }
  return err
}

It can be seen that in order to serialize the User object into binary, it hard codes the organization and order of the entire structure in memory, and performs mandatory type conversion for each field. If we add a new field, we need to recompile the Stub code and require all Clients to be updated (of course, it is not necessary to use the new field and update it). The deserialization steps are similar.

The lengthy code above is just the simplest message structure we use for demonstration. For the real message type in the production environment, this Stub code will be more complicated.

Stub code generation is only to solve the problem of cross-language calls, and it is not a must. If your caller and callee are both in the same language, and they will be guaranteed to be in the same language in the future, you will also choose to write the IDL definition directly in the target language, skipping the compilation step, such as in Thrift The drift project is to use Java to directly write definition files:

@ThriftStruct
public class User
{
    
    
    private final String name;
    private final String email;

    @ThriftConstructor
    public User(String name, String email)
    {
    
    
        this.name = name;
        this.email = email;
    }

    @ThriftField(1)
    public String getName()
    {
    
    
        return name;
    }

    @ThriftField(2)
    public String getEmail()
    {
    
    
        return email;
    }
}

As we said earlier, the significance of serialization itself is to provide a conversion of data understanding from the perspective of humans and machines. The traditional way of thinking is through an intermediate structure, and this way is by providing operation functions.

However, a common problem with this type of method is that it only provides the ability to manipulate data, but sacrifices the convenience of the programmer to manage the data himself. For example, if we want to know which fields the User structure has, unless the serialized and compiled code provides you with this capability, you will have no way to start with a string of binaries. For example, if you want to combine the User object with some ORM tools and store it in the database directly, you must write a new User struct by hand, and then assign values ​​one by one.

This type of serialization framework is mostly used in core infrastructure services whose data definitions do not change much, such as databases and message queues. If it is used in daily business development, it may not be very cost-effective.

at last

We often hear people discussing on the Internet which serialization protocol has better performance. In fact, if we really study various serialization solutions seriously, we will easily find that the serialization protocol itself is just a document, and its performance depends on how you implement it. Implementations in different languages, and implementations in different ways and methods in the same language, will have a huge impact on the final usability and performance. You can implement the Protobuf protocol with Flatbuffer, which can improve a lot of performance, but it may not be what you want.

Compared with performance, it is more important to figure out which serialization problems we want to solve and which ones we are willing to give up. Only when we have clear requirements can we choose a suitable serialization solution, and we really encounter problems You can also quickly know whether the problem is solvable and how to solve it.

Guess you like

Origin blog.csdn.net/kalvin_y_liu/article/details/129995839
RPC
RPC