background
The project needs to use SparkStreaming to connect to Kafka. I thought it was very simple, but I didn’t expect to encounter a lot of troubles.
version
scala version 2.10, kafka version 2.11.0-0.11.0.0, jdk1.8
pom dependency
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
..............
<repositories>
<repository>
<id>Hortonworks</id>
<url>http://repo.hortonworks.com/content/repositories/releases/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.0</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.51</version>
</dependency>
.............
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
<version>2.1.0</version>
<exclusions>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
<version>4.0.50.Final</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.6</version>
</dependency>
</dependencies>
</project>
The basic framework of the spark program
SparkConf conf = new SparkConf()
.setMaster("local[2]")
.setAppName("ActionConsumer")
.set("spark.serializer", KryoSerializer.class.getCanonicalName())
.registerKryoClasses(new Class[]{ConsumerRecord.class})
.set("spark.kryoserializer.buffer.max", "512m");
// 优雅的关闭,避免在处理数据时yarn kill导致kafka数据重复消费和数据丢失
conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, Durations.seconds(1));
Set<String> topicsSet = Collections.singleton("test0");
//kafka相关参数,必要!缺了会报错
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", KafkaConfig.sBOOTSTRAP_SERVER);
kafkaParams.put("group.id", "group2");
kafkaParams.put("key.deserializer", StringDeserializer.class.getCanonicalName());
kafkaParams.put("value.deserializer", StringDeserializer.class.getCanonicalName());
kafkaParams.put("enable.auto.commit", true);
final JavaInputDStream<ConsumerRecord<Object, Object>> kafkaSteam =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams)
);
kafkaSteam.foreachRDD(new VoidFunction2<JavaRDD<ConsumerRecord<Object, Object>>,
Time>() {
public void call(JavaRDD<ConsumerRecord<Object, Object>> consumerRecordJavaRDD,
Time time) throws Exception {
if (consumerRecordJavaRDD.rdd().count() > 0) {
OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD
.rdd()).offsetRanges();
final long recordCount = consumerRecordJavaRDD.count();
List<ConsumerRecord<Object, Object>> records = consumerRecordJavaRDD.take((int) recordCount);
for (ConsumerRecord<Object, Object> record : records) {
// 获取kafka消息,处理业务逻辑
JSONObject obj = JSON.parseObject(record.value().toString());
System.out.println(obj);
}
((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);
}
}
});
ssc.start();
ssc.awaitTermination();
ssc.close();
Problems encountered
1) The class org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveSparkAppConfig$ cannot be found
Solution, change spark dependency in pom to 2.10
2)、Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient
Solution: I used to connect to the remote spark in the spark program, that is, the remote address passed in by sparkConfig.setLocal(). Now I directly change the parameter to local[2], anyway, I will run the jar package under the server at that time, which is no different from running locally.
3), cannot find the method sparkjava.lang.NoSuchMethodError: javax.servlet.http.HttpServletResponse.getStatus()I
Solution: According to the method provided above, change the pom dependency, the problem is solved unknowingly, no need to import additional javax.servlet.http.HttpServletRequest package
4)、 requirement failed: No output operations registered, so nothing to execute
Solution: The stream needs to perform behavioral operations, such as print and foreachRDD above
5)、JsonMappingException: Incompatible Jackson version: 2.7.8
jackson dependency conflict, copy the relevant content of the pom dependency above
6)、send RPC to java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touc
Solution: Configure as I did above: pom, master
7)、NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllo
Solution: NettyIO dependency conflict, copy the relevant content related to the above pom dependency
8)、ClassNotFoundException org.apache.spark.streaming.kafka010.KafkaRDDPartition
Solution: Reduce the version of scala to 2.10 (I used 2.11 before)
9)、object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord
This is an error reported when outputting rdd related content
Solution: Set the Kyro serializer to sparkConf as at the beginning of the program frame above
SparkConf conf = new SparkConf()
...
.set("spark.serializer", KryoSerializer.class.getCanonicalName())
.registerKryoClasses(new Class[]{ConsumerRecord.class})
...
10)、报错....To avoid this, increase spark.kryoserializer.buffer.max value...
Solution: Kyro cache is not enough, as at the beginning of the above program frame, set the maximum cache of Kyro for sparkConf
SparkConf conf = new SparkConf()
......
.set("spark.kryoserializer.buffer.max", "512m");
11)、 NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
Solution: Reduce the cala version to 2.10 (I used 2.11 before)
12), sparkStreaming's map method is not executed, even if there is data, you can't get in even if you hit a breakpoint
Solution: Like the program framework above, change to foreachRDD to traverse RDD, and then get the amount of data obtained according to the count() method of rdd, and take the data according to this value
13), sparkStreaming loses data
Solution: Did not submit the offset in time, submit the offset according to this part of the code above
kafkaSteam.foreachRDD(new VoidFunction2<JavaRDD<ConsumerRecord<Object, Object>>,
Time>() {
public void call(JavaRDD<ConsumerRecord<Object, Object>> consumerRecordJavaRDD,
Time time) throws Exception {
if (consumerRecordJavaRDD.rdd().count() > 0) {
OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD
.rdd()).offsetRanges();
......
((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);
}
}
});
Turn on automatic submission at the same time
kafkaParams.put("enable.auto.commit", true);
After many tests, it is found that Kafka sends data after streaming stops. After streaming starts, it can continue to receive all new data after the last time, that is, the above two settings do not conflict.
Conclusion
20200722 Record: After compiling and running with scala2.10, I changed the scala version to 2.11, and compiled and ran again. It succeeded without error. It was amazing.