In-depth thinking about rpc framework face-to-face series two

This summary is a continuation of the previous one. Of course, it is also possible to read it independently. These series explain different aspects of the rpc framework.

4 Serialization and deserialization (Alibaba side)

4.1 Why serialize and what problem does it solve? Serialization and deserialization also consume a certain amount of performance

Serialization is the process of converting an object's state information into a form that can be stored or transmitted. Deserialization is the process of converting data in this format back into objects again. Serialization and deserialization do consume some performance, but their raison d'être is to solve the following problems:

  1. Persistence : In order to permanently save the state information of an object to disk (eg, database, file), it is usually necessary to serialize the object. Afterwards, this information can be read from disk and restored as an object by deserialization.

  2. Remote Call (RPC) : When making a remote method call in a distributed system, it is usually necessary to serialize the object into a format so that it can be transmitted over the network. The receiver deserializes the data after receiving it, restores it to the original object, and then processes it.

  3. Data exchange : When two different applications or systems need to exchange data, serialization can provide a common, structured format (such as JSON, XML) to facilitate data exchange.

  4. Caching : When using an external cache like Redis, it is often necessary to serialize objects in order to store them in the cache. When the data is retrieved from the cache, it is deserialized.

  5. Deep Copy : Through serialization and subsequent deserialization, a completely independent copy of an object can be created.

Although serialization and deserialization will bring some performance overhead, in order to meet the above requirements, these overheads are usually acceptable. However, to reduce these overheads, there are usually strategies and best practices, such as choosing an efficient serialization format, using libraries optimized for speed (such as Protocol Buffers, FlatBuffers, etc.), etc.

4.1.1 Can persistence, rpc, data exchange, caching and deep copy operations not be completed without serialization? Why?

Not quite**. Serialization provides a generic, structured, and efficient way for these operations. Without serialization, you can still do the above, but the process may become complicated, inefficient, or unreliable**. Let's analyze in detail:

  1. Persistence : Without serialization, you may need to design specific save and load logic for each object . For example, define a database table column for each object property. Doing this would work, but would get very complicated, especially if the object structure changes frequently.

  2. Remote Call (RPC) : Instead of using serialization, you need to define specialized encoding and decoding logic for each object to be transmitted over the network . Not only does this add complexity, it can also introduce errors and lead to inefficient network communications.

  3. Data exchange : In the absence of serialization, data exchange between two different language systems may require artificially defining and following a specific format. This increases the tight coupling between the two systems, reducing flexibility and maintainability .

  4. Caching : External caching systems such as Redis can usually only store basic data types such as strings or byte arrays. So, to store an object into Redis, you need to convert it to these basic data types, which is essentially serialization.

  5. Deep copy : Without serialization, you may need to manually implement deep copy logic for each object , which is not only time-consuming, but also error-prone.

In summary, while these operations can still be done without serialization, serialization provides us with a structured, efficient, and reliable way to handle these common tasks.

(1) Simple root cause: Objects, files, and data have many different formats, and it is difficult to unify network transmission and persistent storage . Serialization is equivalent to providing a unified storage and transmission format, so that each node of the distributed system can access the data stored in the serialization in a unified way.

Suppose you have a person (Person) object that contains name and age. If it is not serialized, you may need to write code to get the name and age of the person separately, and then save them in a specific format (such as CSV or JSON) Store or send. Then, when needed, you need to write code to parse this format, and then use the parsed results to create a new character object. This process not only requires writing a lot of code, but also if the structure of the object changes (such as adding a new field), you may need to modify your code. If the object structure changes frequently, the code modification is also very frequent.

(2) Others: Some serialization protocols can maintain better readability when decoding, and can also compress data, which can save network bandwidth in this case

4.2 Relationship between serialization and communication protocols

The relationship between serialization and communication protocols is primarily that serialization provides the means to send and receive complex objects in network communication. In network communication, all data must eventually be converted into a stream of bytes before it can be sent over the network. Serialization is the process of this conversion, which converts the state of an object into a stream of bytes. The communication protocol defines how to send and receive these byte streams. So in many cases of network communication, serialization is part of the communication protocol. For example, in the HTTP protocol, we often use JSON or XML as a serialization method to send and receive data.

4.3 Suppose there is a service whose input parameter is an interface. There are four implementation classes under this interface. Each implementation class has different fields. Their characteristic is that they all inherit the same interface. Based on this scenario, your Which serialization method does the rpc framework need to use, and why?

Me: Can you tell me why this involves serialization?

Interviewer: Do you think this scene can work with json? Because you are serializing an interface, not a concrete implementation class

Me: Is it possible to add a field in json to indicate which implementation class is expected to be used?

Interviewer: But after you add fields, how do serialization and deserialization work? For example, the object I serialized at the beginning has only two fields, and then I added several fields later. How does the receiving end know the changed fields? Woolen cloth?

Me: But if you use protocol buffer, it supports your custom fields, and then you can parse it smoothly like this

Interviewer: What is the reason? Why can the protocol buffer perceive new or reduced fields?

Me: Because the serialization of the protocol buffer supports metadata self-description, the storage cell of a field in the proto buffer is determined by (type, length, value). In this case, I can always reasonably split each field. For example, there was only one field "k1": "v1", and now a new field "k10": "v10", then the storage format of the first field is (string,2,k1),(string,2, v1); Add a new field, the data storage format is (string,2,k1),(string,2,v1),(string,3,k10),(string,3v10);;

gpt4 Correct Answer: Forward/Backward Compatibility: This means that older versions of the serialization code can parse data produced by newer versions of the code (forward) and vice versa (backward). In Protocol Buffers, this is achieved by assigning a unique numeric identifier to each field and maintaining the consistency of these identifiers; in addition, it involves the storage format mentioned above, which can be known through the length field The value of the new field id, key and value; this is why Protocol Buffers can perceive the new or reduced fields. Fields can be renamed, added, or removed without breaking compatibility, as long as the identifier remains the same.

Interviewer: json also supports metadata description, but it needs special settings; if you don’t enable it, then each of your fields will become a string, and then go to json separately, right? In the case of a class, json has a character table name. What is the full name of this class? When deserializing, it will find a specific implementation class according to the name of the class. What you just said is that when serializing, it will use a document describing the metadata, and there is no need to reorganize the binary information in the entire binary

4.4 What is the difference between the deep copy implemented by serialization and the deep copy implemented by rewriting java's clone method?

cloneThere are several key differences between using serialization to implement deep copy and using method:

  1. Implementation method :

    • Serialization : Deep copying is achieved by converting an object into a byte stream and then reconstructing a new object from the byte stream.
    • clone method : You need to implement Cloneablethe interface for the object and rewrite clonethe method. When you call clonethe method, a new object is created and the properties of the original object are manually copied to the new object.
  2. Depth :

    • Serialization : Automatically implements a deep copy for an object and all its nested objects, with no extra work required.
    • clone method : The default clonemethod is a shallow copy. To implement a deep copy, you need to manually call clonemethods for each nested object. This can get quite complicated, especially for objects with multiple hierarchies.
  3. performance :

    • Serialization : Due to the conversion between objects and byte streams, the performance overhead of deep copying using serialization is relatively high.
    • clone method : Generally speaking, clonethe deep copy performance implemented by using the method is better, because it operates directly in memory.
  4. Flexibility and Security :

    • Serialization : It is not necessary to write copy logic for each object, but all involved objects must implement Serializablethe interface. Additionally, deep copying using serialization may expose private fields of objects, creating a security risk.
    • clone method : allows you to customize the logic of deep copy for each object, but you need to write the copy logic for each object, which increases the complexity of the implementation .
  5. exception handling :

    • Serialization : Exceptions may be thrown during serialization and deserialization, such as IOExceptionand ClassNotFoundException.
    • clone method : clonemethod may throw CloneNotSupportedException, but only if the object does not implement Cloneablethe interface.
  6. External resources :

    • Serialization : Serialization does not copy the external resources of the object, such as file handles, database connections, etc.
    • clone method : Similarly, clonethe method will not copy external resources.

Overall, implementing deep copies using serialization is an easy and automatic approach, but may sacrifice performance. Using clonemethods to implement deep copies, while more efficient, may require more manual work and maintenance. Which method to choose depends on the specific application scenario and requirements.

5 jdk's serialization interface interface-Serializable (UF side)

5.1 What is this interface used for?

Serializable is a marker interface (that is, an interface without any methods) provided by Java that tells the Java Virtual Machine (JVM) that objects of a class can be serialized. Serialization is the process of converting an object's state information into a form that can be stored or transmitted. This form is usually a byte stream, but other formats are possible. When an object is serialized, it can be written to a file, sent over the network, etc., and then deserialized back to an object when appropriate.

Serialization is mainly used in the following situations:

  1. Persistent data: The state of an object can be stored on disk and restored later.
  2. Remote Method Invocation (RMI): Objects can be transferred over the network and then deserialized on the other side for use on the remote system.
  3. Distributed Computing: Objects can be transferred between different nodes in a cluster or grid.

In a nutshell, the Serializable interface enables an object of a class to be converted into a stream of bytes, which can be stored, transferred, and later converted back to the original object state.

5.2 We often serialize object instances into a json in spring, so there is no need to implement this interface at this time, right?

Yes, you are right. In Spring or many other frameworks, when we serialize an object instance to JSON, we usually don't need that object to implement the Serializable interface. This is because JSON serialization typically accesses object fields through reflection and converts them to JSON strings, rather than through Java's built-in serialization mechanism.

The Serializable interface is primarily used with Java's native serialization mechanism, especially when you need to save object state to a file or send objects over the network between Java applications. This mechanism involves converting the object into a specific byte stream format that can then be deserialized back to the object.

But in modern applications, JSON has become a very popular data exchange format, especially in web applications and RESTful services. In order to serialize and deserialize JSON, we usually use libraries such as Jackson, Gson, etc. These libraries do not require objects to implement the Serializable interface.

5.3 How to use this serialization interface

When you want to serialize an object through Java's native serialization mechanism, the object's class must implement the Serializable interface. This is a marker interface, which means it doesn't have any methods to implement, but it tells the JVM that this object is serializable.

Following is a simple Java example showing how to serialize and deserialize using the Serializable interface:

import java.io.*;

class Person implements Serializable {
    
    
    private static final long serialVersionUID = 1L;

    String name;
    int age;

    Person(String name, int age) {
    
    
        this.name = name;
        this.age = age;
    }

    @Override
    public String toString() {
    
    
        return "Person [name=" + name + ", age=" + age + "]";
    }
}

public class SerializationDemo {
    
    
    public static void main(String[] args) {
    
    
        // 对象序列化
        try (ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("person.ser"))) {
    
    
            Person person = new Person("John", 25);
            oos.writeObject(person);
            System.out.println("Person object has been serialized.");
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }

        // 对象反序列化
        try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream("person.ser"))) {
    
    
            Person deserializedPerson = (Person) ois.readObject();
            System.out.println("Deserialized Person: " + deserializedPerson);
        } catch (IOException | ClassNotFoundException e) {
    
    
            e.printStackTrace();
        }
    }
}

5.4 There is an id in this interface, do you know what this id is for?

It is serialVersionUID, a private static constant used to represent the serialized version. This is optional, but it is recommended to always include it to ensure serialization compatibility.

Precautions:

  1. serialVersionUID

is a private static constant representing the serialized version. This is optional, but it is recommended to always include it to ensure serialization compatibility.

  1. If the fields of the class are changed (such as adding new fields), it may need to be changed

serialVersionUID. If you do not set serialVersionUID and change the structure of the class, you may get an InvalidClassException when trying to deserialize the old object.

  1. Not all Java objects can be serialized. An object must be serializable, and all objects it references must also be serializable. If an object contains a field that cannot be serialized, you can mark the field as transient so that it will not be serialized. Use ObjectOutputStream to serialize the object and write it to a file. Use ObjectInputStream to read and deserialize objects from a file.

5.4.1 Why is such a serialVersionUID field needed?

Answer: Because when the receiver deserializes the byte stream sent by the sender, it needs an object to connect to. All the fields parsed from the byte stream must exist in all the connected objects, otherwise it will cause data inconsistency. This means that the versions of the serialized and deserialized objects of the receiving and sending parties must be consistent. For example, when the sender sends the byte stream of the Person object for the first time, this object only has the name field, and the docking object of the receiver only has the name field. The object version number used by both the receiver and the sender is 1, so the first reception The party deserializes successfully, and then the sender adds an age field to the Person, and the version number is set to 2, but the docking Person of the receiver does not add this field and the version number is still 1, then the sender second time The byte stream sent, the receiver will throw an exception.

5.4.2 So if the sender adds a new field, the serialVersionUID will be auto-incremented, and at the same time, the corresponding receiving class of the receiving end needs to add the corresponding field and auto-increment the corresponding version, so the version and field of both parties are updated synchronously Next, it must be deserialized successfully, right?

Yes, you understood exactly. When you manage serialVersionUID manually, when the sender's class structure changes (for example, adding or removing fields), you need to make sure to: update serialVersionUID. Ensures that the class definitions used by the sender and receiver are synchronized. As long as the class definitions (including fields and serialVersionUID) on both sides are consistent, serialization and deserialization can be successful. But it should be noted that just synchronizing serialVersionUID is not enough. For example, if the sender adds a field and updates the serialVersionUID, but the receiver only updates the serialVersionUID without adding the corresponding field, then an error will still occur during deserialization.
Therefore, manually managing serialVersionUID and ensuring class synchronization are both to ensure serialization compatibility. This works well when you have full control and can ensure that both sides are updated in sync. But in large distributed systems, more policies and tools may be needed to manage this compatibility.

5.4.3 If the serialized original class of the receiver and the sender can be updated synchronously and kept consistent, should the serialVersionUID be allowed to be removed?

In theory, if you ensure that the class structure at both ends is always consistent, then serialVersionUID can be omitted, because the JVM will generate a default version number for classes that do not define serialVersionUID. This default version number is calculated based on the structure of the class (including member variables, methods, etc.).
But in fact, there is a risk in doing so:
implicit behavior: omitting serialVersionUID means that the JVM will generate a default version number for your class. When any details of the class change, so does this version number. This implicit behavior can cause unforeseen deserialization problems.
Uncertainty across JVM versions: Different JVM implementations (or versions) may produce different default serialVersionUID values, which may cause problems when serializing/deserializing between different JVM versions.
Less chance of error: Manually setting serialVersionUID means that you have considered version compatibility issues. Consciously changing or maintaining the serialVersionUID value can reduce unexpected serialization problems when the structure of the class changes.
So, even if you can ensure that the class structure of the sender and receiver is always consistent, to reduce potential serialization problems, it is best practice to explicitly define serialVersionUID for your serializable classes.

5.5 In a distributed environment, how do you generally manage the version of this object?

In a distributed environment, object version management becomes particularly important, because inter-service communication, data exchange, and state sharing are the foundations of distributed systems. Below are some recommendations and best practices to ensure object versions are properly managed in a distributed environment:

  1. Centralized warehouse : Use a centralized warehouse (such as Maven Central, Nexus, Artifactory, etc.) to store and manage all JAR packages and libraries. This ensures that in a distributed environment, all services reference the same library version.

  2. Contract-Driven Design : In a microservices environment, you can use tools such as Spring Cloud Contract to define and verify interactions between services. This ensures a consistent interface and data format across services without requiring each service to be updated to the latest version.

  3. Use data schema management : For systems like Apache Kafka, Apache Avro, you can use Confluent Schema Registry or Apache Avro's built-in schema version control to manage changes in data structures.

  4. Backward Compatibility : Try to make new versions of objects backward compatible, so that even if the service versions are inconsistent, they can still interact normally.

  5. Version naming conventions : Follow a consistent version naming convention, such as Semantic Versioning, so that you can easily understand the nature of the change by the version number.

  6. Deprecation policy : If you need to remove or change a part of an object, provide a transition period and support the old version during that time. This gives other services enough time to make necessary adjustments.

  7. Service discovery and registration : Use service registration and discovery mechanisms (such as Eureka, Consul, etc.), so that services can know the versions of other services and make decisions accordingly.

  8. Monitoring and Alerting : Use monitoring tools to track version changes in a distributed environment. Warn immediately if inconsistent versions are detected.

  9. Canary deployments vs. canary rollouts : When a new version of a service or object is introduced, do not immediately deploy it on all instances. Deploy on a small number of instances first to ensure compatibility with other services, and then gradually expand the deployment.

  10. Maintain Documentation : Documentation is continuously updated, documenting changes for each version and differences between versions.

In a distributed environment, version management is an ongoing, multi-faceted process. Collaborating with the team, developing a strategy, and using tools to automate the process is key to ensuring success.

Guess you like

Origin blog.csdn.net/yxg520s/article/details/132286710