streaming读取kafka数据的分区和主题
- 0.10.0 or higher
导入pom。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.0</version>
</dependency>
0.10的版本很方便获取每条数据的元信息,适用于createDirectStream多topic消费。代码参考自官网。
public static void main(String[] args) throws InterruptedException {
SparkConf conf =new SparkConf().setMaster("local[2]").setAppName("test");
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
HashMap kafkaParams = new HashMap();
kafkaParams.put("bootstrap.servers", "fp-bd5:6667");
kafkaParams.put("acks", "all");
kafkaParams.put("retries", 0);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", true);
HashSet<String> topicSet = new HashSet<>();
topicSet.add("test2");
topicSet.add("test1");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topicSet, kafkaParams)
);
stream.foreachRDD(rdd->rdd.foreach(
// 每个ConsumerRecord中有数据元信息和k,v可以直接读取,然后根据topic分别处理不同业务逻辑
record->{
System.out.println("topic:"+record.topic()+", partition:"+record.partition() +", value:"+record.value());
})
);
jssc.start();
jssc.awaitTermination();
}
- 0.8.2.1
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.0</version>
</dependency>
老版本只能批量获取位置信息,主要用途就是保存每个partition消费的offset信息。所以多个主题一般是createDirectStream多个实例,或者将消息的key作为topic等方式。
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicSet
);
directStream.transformToPair(rdd -> { OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); offsetRanges.set(offsets); return rdd;
}).map(r->r).foreachRDD(rdd-> {
for (OffsetRange o : offsetRanges.get()) {
// 数据按主题和分区 划分多个OffsetRange
System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset()
);}
rdd.foreach(r-> System.out.println(r.toString()));
});