flume消费kafka数据获取元数据信息
在使用flume消费kafka的时候,在有些业务场景中,我们需要获取kafka的元数据信息来满足业务需求,文章将从flume的连接器kafkaSource的源码分析,flume在消费kafka数据时,如何获取kafka中的元数据信息。
KafkaSource 源码
我们知道,flume中数据是以Event进行传输的,Event在源码中是一个顶层接口,我们来看一下Event的一个实现类SimpleEvent,源码如下:
public class SimpleEvent implements Event {
private Map<String, String> headers;
private byte[] body;
public SimpleEvent() {
headers = new HashMap<String, String>();
body = new byte[0];
}
@Override
public Map<String, String> getHeaders() {
return headers;
}
@Override
public void setHeaders(Map<String, String> headers) {
this.headers = headers;
}
@Override
public byte[] getBody() {
return body;
}
@Override
public void setBody(byte[] body) {
if (body == null) {
body = new byte[0];
}
this.body = body;
}
@Override
public String toString() {
Integer bodyLen = null;
if (body != null) bodyLen = body.length;
return "[Event headers = " + headers + ", body.length = " + bodyLen + " ]";
}
}
可以看到,SimpleEvent 的成员变量有两个:headers和body。headers就是一个HashMap,body是一个byte数组。我们接着看KafkaSource中的源码,由于源码较长,篇幅原因,我们截取其中的片段来看。
// get next message
ConsumerRecord<String, byte[]> message = it.next();
kafkaKey = message.key();
if (useAvroEventFormat) {
//Assume the event is in Avro format using the AvroFlumeEvent schema
//Will need to catch the exception if it is not
ByteArrayInputStream in =
new ByteArrayInputStream(message.value());
decoder = DecoderFactory.get().directBinaryDecoder(in, decoder);
if (!reader.isPresent()) {
reader = Optional.of(
new SpecificDatumReader<AvroFlumeEvent>(AvroFlumeEvent.class));
}
//This may throw an exception but it will be caught by the
//exception handler below and logged at error
AvroFlumeEvent avroevent = reader.get().read(null, decoder);
eventBody = avroevent.getBody().array();
headers = toStringMap(avroevent.getHeaders());
} else {
eventBody = message.value();
headers.clear();
headers = new HashMap<String, String>(4);
}
// Add headers to event (timestamp, topic, partition, key) only if they don't exist
if (!headers.containsKey(KafkaSourceConstants.TIMESTAMP_HEADER)) {
headers.put(KafkaSourceConstants.TIMESTAMP_HEADER,
String.valueOf(System.currentTimeMillis()));
}
// Only set the topic header if setTopicHeader and it isn't already populated
if (setTopicHeader && !headers.containsKey(topicHeader)) {
headers.put(topicHeader, message.topic());
}
if (!headers.containsKey(KafkaSourceConstants.PARTITION_HEADER)) {
headers.put(KafkaSourceConstants.PARTITION_HEADER,
String.valueOf(message.partition()));
}
if (!headers.containsKey(OFFSET_HEADER)) {
headers.put(OFFSET_HEADER,
String.valueOf(message.offset()));
}
if (kafkaKey != null) {
headers.put(KafkaSourceConstants.KEY_HEADER, kafkaKey);
}
if (log.isTraceEnabled()) {
if (LogPrivacyUtil.allowLogRawData()) {
log.trace("Topic: {} Partition: {} Message: {}", new String[]{
message.topic(),
String.valueOf(message.partition()),
new String(eventBody)
});
} else {
log.trace("Topic: {} Partition: {} Message arrived.",
message.topic(),
String.valueOf(message.partition()));
}
}
event = EventBuilder.withBody(eventBody, headers);
eventList.add(event);
我们重点关注headers.put()方法,可以发现headers中value存储了topic,partition,offset,时间这四种元数据信息。接下来我们看一下如何获取这四种元数据信息。
- 获取topic的值:
if (setTopicHeader && !headers.containsKey(topicHeader)) {
headers.put(topicHeader, message.topic());
}
当setTopicHeader 为true,同时headers不包含topicHeader时,headers将存储topic的值,我们接着setTopicHeader 和 topicHeader这两个变量是如何被赋值的。
setTopicHeader = context.getBoolean(KafkaSourceConstants.SET_TOPIC_HEADER,
KafkaSourceConstants.DEFAULT_SET_TOPIC_HEADER);
topicHeader = context.getString(KafkaSourceConstants.TOPIC_HEADER,
KafkaSourceConstants.DEFAULT_TOPIC_HEADER);
可以看到,这两个变量都是从KafkaSourceConstants这个类中获取默认值的,我们看一下KafkaSourceConstants这个类的源码。
public class KafkaSourceConstants {
public static final String KAFKA_PREFIX = "kafka.";
public static final String KAFKA_CONSUMER_PREFIX = KAFKA_PREFIX + "consumer.";
public static final String DEFAULT_KEY_DESERIALIZER =
"org.apache.kafka.common.serialization.StringDeserializer";
public static final String DEFAULT_VALUE_DESERIALIZER =
"org.apache.kafka.common.serialization.ByteArrayDeserializer";
public static final String BOOTSTRAP_SERVERS =
KAFKA_PREFIX + CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG;
public static final String TOPICS = KAFKA_PREFIX + "topics";
public static final String TOPICS_REGEX = TOPICS + "." + "regex";
public static final String DEFAULT_AUTO_COMMIT = "false";
public static final String BATCH_SIZE = "batchSize";
public static final String BATCH_DURATION_MS = "batchDurationMillis";
public static final int DEFAULT_BATCH_SIZE = 1000;
public static final int DEFAULT_BATCH_DURATION = 1000;
public static final String DEFAULT_GROUP_ID = "flume";
public static final String MIGRATE_ZOOKEEPER_OFFSETS = "migrateZookeeperOffsets";
public static final boolean DEFAULT_MIGRATE_ZOOKEEPER_OFFSETS = true;
public static final String AVRO_EVENT = "useFlumeEventFormat";
public static final boolean DEFAULT_AVRO_EVENT = false;
/* Old Properties */
public static final String ZOOKEEPER_CONNECT_FLUME_KEY = "zookeeperConnect";
public static final String TOPIC = "topic";
public static final String OLD_GROUP_ID = "groupId";
// flume event headers
public static final String DEFAULT_TOPIC_HEADER = "topic";
public static final String KEY_HEADER = "key";
public static final String TIMESTAMP_HEADER = "timestamp";
public static final String PARTITION_HEADER = "partition";
public static final String OFFSET_HEADER = "offset";
public static final String SET_TOPIC_HEADER = "setTopicHeader";
public static final boolean DEFAULT_SET_TOPIC_HEADER = true;
public static final String TOPIC_HEADER = "topicHeader";
}
KafkaSourceConstants这个类定义了配置文件中kafka source可以配置的地方有哪些,可以看到setTopicHeader的默认值是true,topicHeader的默认值为DEFAULT_TOPIC_HEADER = “topic”;因此,我们只要在配置文件中不配置setTopicHeader=false的情况下headers.get(“topic”);即可拿到每条数据中topic的值。
- 获取key的值
if (kafkaKey != null) {
headers.put(KafkaSourceConstants.KEY_HEADER, kafkaKey);
}
可以发现public static final String KEY_HEADER = “key”,因此我们只要headers.get(“key”);即可拿到key的值。
- 获取时间
if (!headers.containsKey(KafkaSourceConstants.TIMESTAMP_HEADER)) {
headers.put(KafkaSourceConstants.TIMESTAMP_HEADER,
String.valueOf(System.currentTimeMillis()));
}
public static final String TIMESTAMP_HEADER = "timestamp";
通过headers.get(“timestamp”)即可拿到kafka数据发送到kafka系统中的时间。
- 获取offset的值
if (!headers.containsKey(OFFSET_HEADER)) {
headers.put(OFFSET_HEADER,
String.valueOf(message.offset()));
}
public static final String OFFSET_HEADER = "offset";
通过headers.get(“offset”)即可拿到offset的值。
自定义sink的代码dome
下面是一个自定义flume sink的dome,继承AbstractSink类,重写process方法。
public class kafakSink extends AbstractSink {
@Override
public Status process() throws EventDeliveryException {
//获取channel
Channel ch = getChannel();
//获取事务
Transaction txn = ch.getTransaction();
//开启事务
txn.begin();
//获取event
Event event = null;
while (true) {
event = ch.take();
if (event != null) {
break;
}
}
//获取event中的Topic和key等元数据信息
Map<String, String> headers = event.getHeaders();
String topic = headers.get("topic");
String key = headers.get("key");
String time = headers.get("timestamp");
String offset = headers.get("offset");
/*
自定义处理逻辑
your code ...
*/
return Status.READY;
}
}