Flink消費kafka数据的反序列化方式


Flink DataStream Api在接收kafka数据时,需要进行反序列化,以便进行后续的逻辑处理。本文根据作者的开发经验,简单介绍几种常用的kafka反序列化方式。

Flink反序列化顶层接口

1.DeserializationSchema接口

源码如下:

/**
 * The deserialization schema describes how to turn the byte messages delivered by certain
 * data sources (for example Apache Kafka) into data types (Java/Scala objects) that are
 * processed by Flink.
 * 翻译:
 * 反序列化模式描述了如何将某些数据源(例如Apache Kafka)传递的字节消息转换为Flink处理的数据类型(Java / Scala对象)
 *
 * <p>In addition, the DeserializationSchema describes the produced type ({@link #getProducedType()}),
 * which lets Flink create internal serializers and structures to handle the type.
 * 翻译:
 * 此外,DeserializationSchema描述了产生的类型({@link #getProducedType()}),它使Flink可以创建内部序列化器和结构来处理该类型
 
 * <p><b>Note:</b> In most cases, one should start from {@link AbstractDeserializationSchema}, which
 * takes care of producing the return type information automatically.
 * 
 * 在大多数情况下,应该从{@link AbstractDeserializationSchema}开始,它负责自动生成返回类型信息
 
 * <p>A DeserializationSchema must be {@link Serializable} because its instances are often part of
 * an operator or transformation function.
 *
 DeserializationSchema必须为{@link Serializable},因为其实例通常是运算符或转换函数的一部分
 
 * @param <T> The type created by the deserialization schema.
 */
@Public
public interface DeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {
    
    

	/**
	 * Deserializes the byte message.
	 *
	 * @param message The message, as a byte array.
	 *
	 * @return The deserialized message as an object (null if the message cannot be deserialized).
	 */
	T deserialize(byte[] message) throws IOException;

	/**
	 * Method to decide whether the element signals the end of the stream. If
	 * true is returned the element won't be emitted.
	 *
	 * @param nextElement The element to test for the end-of-stream signal.
	 * @return True, if the element signals end of stream, false otherwise.
	 */
	boolean isEndOfStream(T nextElement);
}

DeserializationSchema是顶层接口,定义了如何将某些数据源转化成flink能够处理的数据类型。

2.KafkaDeserializationSchema接口
/**
 * The deserialization schema describes how to turn the Kafka ConsumerRecords
 * into data types (Java/Scala objects) that are processed by Flink.
 * 
 *翻译:反序列化模式描述了如何将Kafka ConsumerRecords转换为Flink处理的数据类型(Java / Scala对象)
 
 * @param <T> The type created by the keyed deserialization schema.
 */
@PublicEvolving
public interface KafkaDeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {
    
    

	/**
	 * Method to decide whether the element signals the end of the stream. If
	 * true is returned the element won't be emitted.
	 *
	 * @param nextElement The element to test for the end-of-stream signal.
	 *
	 * @return True, if the element signals end of stream, false otherwise.
	 */
	boolean isEndOfStream(T nextElement);

	/**
	 * Deserializes the Kafka record.
	 *
	 * @param record Kafka record to be deserialized.
	 *
	 * @return The deserialized message as an object (null if the message cannot be deserialized).
	 */
	T deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception;
}

可以发现,DeserializationSchema和KafkaDeserializationSchema是同级的接口,都继承了Serializable, ResultTypeQueryable这两个接口。不同点是,deserialize方法接口的参数不一样,KafkaDeserializationSchema接口为反序列化kafka数据而生。DeserializationSchema接口可以反序列化任意二进制数据,更加具有通用性。

接收kafka数据时常用的反序列化类

1.SimpleStringSchema类
/**
 * Very simple serialization schema for strings.
 *
 * <p>By default, the serializer uses "UTF-8" for string/byte conversion.
 */
public class SimpleStringSchema implements DeserializationSchema<String>, SerializationSchema<String>

此类实现了DeserializationSchema和SerializationSchema两个接口,可以同时用于序列化和反序列化,在接收和发送kafka数据的时候都可以使用。
官方注释说 Very simple serialization schema for strings :非常简单的字符串序列化架构。将数据序列化和反序列化成String类型,默认编码格式为UTF-8。

2.JSONKeyValueDeserializationSchema类
/**
 * DeserializationSchema that deserializes a JSON String into an ObjectNode.
 * 将JSON字符串反序列化为ObjectNode的DeserializationSchema
 * 
 * <p>Key fields can be accessed by calling objectNode.get("key").get(&lt;name>).as(&lt;type>)
 *
 * <p>Value fields can be accessed by calling objectNode.get("value").get(&lt;name>).as(&lt;type>)
 *
 * <p>Metadata fields can be accessed by calling objectNode.get("metadata").get(&lt;name>).as(&lt;type>) and include
 * the "offset" (long), "topic" (String) and "partition" (int).
 */
@PublicEvolving
public class JSONKeyValueDeserializationSchema implements KafkaDeserializationSchema<ObjectNode> {
    
    
	private static final long serialVersionUID = 1509391548173891955L;

	private final boolean includeMetadata;
	private ObjectMapper mapper;

	public JSONKeyValueDeserializationSchema(boolean includeMetadata) {
    
    
		this.includeMetadata = includeMetadata;
	}

	@Override
	public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
    
    
		if (mapper == null) {
    
    
			mapper = new ObjectMapper();
		}
		ObjectNode node = mapper.createObjectNode();
		if (record.key() != null) {
    
    
			node.set("key", mapper.readValue(record.key(), JsonNode.class));
		}
		if (record.value() != null) {
    
    
			node.set("value", mapper.readValue(record.value(), JsonNode.class));
		}
		if (includeMetadata) {
    
    
			node.putObject("metadata")
				.put("offset", record.offset())
				.put("topic", record.topic())
				.put("partition", record.partition());
		}
		return node;
	}

	@Override
	public boolean isEndOfStream(ObjectNode nextElement) {
    
    
		return false;
	}

	@Override
	public TypeInformation<ObjectNode> getProducedType() {
    
    
		return getForClass(ObjectNode.class);
	}
}

JSONKeyValueDeserializationSchema的构造方法需要传入一个布尔值,参数为true时,反序列化的数据中带有kafka metadata元数据信息,包括offset,topic,partition信息。参数为false时,反序列化的数据中只有kafka中key和value的信息。

deserialize方法的返回值为ObjectNode,由以下源码可知,ObjectNode由Map<String, JsonNode> 组成。因此在获取kafka数据时,通过ObjectNode.get(“key”)、ObjectNode.get(“value”).get(filed name)、ObjectNode.get(“metadata”).get(“offset”)来获取我们想要的数据。

public class ObjectNode extends ContainerNode<ObjectNode> {
    
    
    protected final Map<String, JsonNode> _children;

注意:在使用此类反序列化时,要求kafka中传输的数据为JSON字符串,负责无法序列化。

3.实现KafkaDeserializationSchema自定义序列化类

由于作者项目中需要获取kafka中key和topic的值,同时kafka中的数据又包含非JSON类型的数据,所以需要自定义实现KafkaDeserializationSchema以获取key和topic信息。

/**
 * @program: 
 * @description: 自定义反序列化类获取key,value,topic的值
 * @author: 
 * @create: 
 **/
public class CustomKeyValueDeserializationSchema implements KeyedDeserializationSchema<String> {
    
    
    @Override
    public String deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset) throws IOException {
    
    
        StringBuffer stringBuffer = new StringBuffer();
        String mskey = new String(messageKey, StandardCharsets.UTF_8);
        String ms = new String(message, StandardCharsets.UTF_8);
        //使用"\t"进行分割,便于后续逻辑处理时做数据切分。返回值为String类型
        return stringBuffer.append(ms).append("\t").append(mskey).append("\t").append(topic).toString();
    }

    @Override
    public boolean isEndOfStream(String nextElement) {
    
    
        return false;
    }
	
	//定义返回值类型。
    @Override
    public TypeInformation<String> getProducedType() {
    
    
        return TypeInformation.of(String.class);
    }
}

自定义类,可以灵活的处理反序列化的数据,获取自己想要的信息。

猜你喜欢

转载自blog.csdn.net/weixin_41197407/article/details/112392393