kafka - custom serializer

kafka - custom serializer

The following serializers are provided in kafka:

  1. ByteArraySerializer
  2. StringSerializer
  3. IntegerSerializer

However, the built-in serializer cannot meet the needs of most scenarios, so we need to customize the serializer


1. Custom serializer

1.1 Customers

We start by creating a simple class to represent customers:

public class Customer {
    private int customerID;
    private String customerName;
    public Customer(int ID, String name) {
        this.customerID = ID;
        this.customerName = name;
    }
    public int getID() {
        return customerID;
    }
    public String getName() {
        return customerName;
    }
}

1.2 Defining the serializer

Next create the serializer for the Customer class:

import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Serializer;

import java.nio.ByteBuffer;
import java.util.Map;

/**
 * Created by Joe on 2018/4/19
 */
public class CustomerSerializer implements Serializer<Customer> {

    @Override
    public void configure(Map configs, boolean isKey) {
        // 不做任何配置
    }

    @Override
    /*
     * Customer对象被序列化成:
     * 表示customerID的4字节整数
     * 表示customerName长度的4字节整数(如果customerName为空,则长度为0)
     * 表示customerName的N个字节
     */
    public byte[] serialize(String topic, Customer data) {
        try {
            byte[] serializedName;
            int stringSize;
            if (data == null)
                return null;
            else {
                if (data.getName() != null) {
                    serializedName = data.getName().getBytes("UTF-8");
                    stringSize = serializedName.length;
                } else {
                    serializedName = new byte[0];
                    stringSize = 0;
                }
            }
            ByteBuffer buffer = ByteBuffer.allocate(4 + 4 + stringSize);
            buffer.putInt(data.getID());
            buffer.putInt(stringSize);
            buffer.put(serializedName);

            return buffer.array();
        } catch (Exception e) {
            throw new SerializationException("Error when serializing Customer to  byte[] " + e);
        }
    }

    @Override
    public void close() {
        // nothing to close
    }
}

After defining the CustomerSerializerclass, we can define ProducerRecord<String, Customer>the type of message to pass to kafka.

But using a custom serializer like this is still flawed:

  1. If we have multiple types of consumers, we may need to change the customerID field into a long integer, or add a startDate field for Customer, which will cause compatibility issues between old and new messages.
  2. Multiple teams writing Customer data to Kafka at the same time will need to use the same serializer. If the serializer changes, they also need to change the code at the same time.

Therefore, in actual use, it is not recommended to use custom serializers, but to use existing serializers and deserializers, such as JSON, Avro, Thrift or Protobuf.

2. Avro serialization

Apache Avro is a programming language-agnostic serialization format. Avro data is defined by a language-independent schema. The schema is described by JSON, and the data is serialized into binary or JSON files, but binary files are generally used.

An interesting feature of Avro is that when the application responsible for writing messages uses a new schema, the application responsible for reading messages can continue to process messages without any changes. This feature makes it particularly suitable for use in applications like Kafka. on the message system.

Suppose you have the following schema:

{
    "namespace": "customerManagement.avro",
    "type": "record",
    "name": "Customer",
    "fields": [{
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "faxNumber",
            "type": ["null", "string"],
            "default": "null"
        }
    ]
}

This schema indicates that the id and name attributes are required. The faxNumber field is optional and null by default.

Suppose this schema has been running normally for a while and is generating a lot of data. Then in the later development, it was decided to use the email field to replace it. At this time, we can generate the following new schema.

{
    "namespace": "customerManagement.avro",
    "type": "record",
    "name": "Customer",
    "fields": [{
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "email",
            "type": ["null", "string"],
            "default": "null"
        }
    ]
}

Before an application upgrade, they call methods like getName(), getId(), and getFaxNumber(). If you encounter a message constructed with the new schema, the getName() and getId() methods will still return normally, but the getFaxNumber() method will return null because the message does not contain a fax number.

After an application upgrade, the getEmail() method replaced the getFaxNumber() method. The getEmail() method will return null if it encounters a message constructed using the old schema, because the old message does not contain an email address.

From this you can see the benefit of using Avro: when the schema of the message is modified, all the applications responsible for reading the data are not updated. Still no exceptions or blocking errors, and no major updates to existing data.

But still need to pay attention to the following two points:

  1. The schemas used to write and read data must be compatible with each other. The Avro documentation mentions some compatibility principles.
  2. The deserializer needs to use the schema used to write the data, even though it may not be the same as the schema used to read the data. Avro data files contain the schema for writing data.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325951056&siteId=291194637