background

The project needs to use SparkStreaming to connect to Kafka. I thought it was very simple, but I didn’t expect to encounter a lot of troubles.

version

scala version 2.10, kafka version 2.11.0-0.11.0.0, jdk1.8

pom dependency

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    ..............

    <repositories>
        <repository>
            <id>Hortonworks</id>
            <url>http://repo.hortonworks.com/content/repositories/releases/</url>
        </repository>
    </repositories>


    <dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>


        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.51</version>
        </dependency>

               .............

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
            <version>2.1.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.kafka</groupId>
                    <artifactId>kafka-clients</artifactId>
                </exclusion>
            </exclusions>
        </dependency>


        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty-all</artifactId>
            <version>4.0.50.Final</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.6.6</version>
        </dependency>
    </dependencies>

</project>

The basic framework of the spark program

       SparkConf conf = new SparkConf()
                .setMaster("local[2]")
                .setAppName("ActionConsumer")
                .set("spark.serializer", KryoSerializer.class.getCanonicalName())
                .registerKryoClasses(new Class[]{ConsumerRecord.class})
                .set("spark.kryoserializer.buffer.max", "512m");


        // 优雅的关闭，避免在处理数据时yarn kill导致kafka数据重复消费和数据丢失
        conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaStreamingContext ssc = new JavaStreamingContext(sc, Durations.seconds(1));


        Set<String> topicsSet = Collections.singleton("test0");
        //kafka相关参数，必要！缺了会报错
        Map<String, Object> kafkaParams = new HashMap<String, Object>();
        kafkaParams.put("bootstrap.servers", KafkaConfig.sBOOTSTRAP_SERVER);
        kafkaParams.put("group.id", "group2");
        kafkaParams.put("key.deserializer", StringDeserializer.class.getCanonicalName());
        kafkaParams.put("value.deserializer", StringDeserializer.class.getCanonicalName());
        kafkaParams.put("enable.auto.commit", true);


        final JavaInputDStream<ConsumerRecord<Object, Object>> kafkaSteam =
                KafkaUtils.createDirectStream(
                        ssc,
                        LocationStrategies.PreferConsistent(),
                        ConsumerStrategies.Subscribe(topicsSet, kafkaParams)
                );


        kafkaSteam.foreachRDD(new VoidFunction2<JavaRDD<ConsumerRecord<Object, Object>>,
                        Time>() {
            public void call(JavaRDD<ConsumerRecord<Object, Object>> consumerRecordJavaRDD,
                             Time time) throws Exception {
                if (consumerRecordJavaRDD.rdd().count() > 0) {
                    OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD
                            .rdd()).offsetRanges();

                    final long recordCount = consumerRecordJavaRDD.count();
                    List<ConsumerRecord<Object, Object>> records = consumerRecordJavaRDD.take((int) recordCount);
                    for (ConsumerRecord<Object, Object> record : records) {
                        // 获取kafka消息，处理业务逻辑
                        JSONObject obj = JSON.parseObject(record.value().toString());
                        System.out.println(obj);
                    }

                    ((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);
                }
            }
        });


        ssc.start();
        ssc.awaitTermination();
        ssc.close();

Problems encountered

1) The class org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveSparkAppConfig$ cannot be found

Solution, change spark dependency in pom to 2.10

2）、Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient

Solution: I used to connect to the remote spark in the spark program, that is, the remote address passed in by sparkConfig.setLocal(). Now I directly change the parameter to local[2], anyway, I will run the jar package under the server at that time, which is no different from running locally.

3), cannot find the method sparkjava.lang.NoSuchMethodError: javax.servlet.http.HttpServletResponse.getStatus()I

Solution: According to the method provided above, change the pom dependency, the problem is solved unknowingly, no need to import additional javax.servlet.http.HttpServletRequest package

4）、 requirement failed: No output operations registered, so nothing to execute

Solution: The stream needs to perform behavioral operations, such as print and foreachRDD above

5)、JsonMappingException: Incompatible Jackson version: 2.7.8

jackson dependency conflict, copy the relevant content of the pom dependency above

6）、send RPC to java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touc

Solution: Configure as I did above: pom, master

7）、NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllo

Solution: NettyIO dependency conflict, copy the relevant content related to the above pom dependency

8）、ClassNotFoundException org.apache.spark.streaming.kafka010.KafkaRDDPartition

Solution: Reduce the version of scala to 2.10 (I used 2.11 before)

9）、object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord

This is an error reported when outputting rdd related content

Solution: Set the Kyro serializer to sparkConf as at the beginning of the program frame above

       SparkConf conf = new SparkConf()

                ...

                .set("spark.serializer", KryoSerializer.class.getCanonicalName())

                .registerKryoClasses(new Class[]{ConsumerRecord.class})

                ...

10）、报错....To avoid this, increase spark.kryoserializer.buffer.max value...

Solution: Kyro cache is not enough, as at the beginning of the above program frame, set the maximum cache of Kyro for sparkConf

       SparkConf conf = new SparkConf()

                ......

                .set("spark.kryoserializer.buffer.max", "512m");

11）、 NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;

Solution: Reduce the cala version to 2.10 (I used 2.11 before)

12), sparkStreaming's map method is not executed, even if there is data, you can't get in even if you hit a breakpoint

Solution: Like the program framework above, change to foreachRDD to traverse RDD, and then get the amount of data obtained according to the count() method of rdd, and take the data according to this value

13), sparkStreaming loses data

Solution: Did not submit the offset in time, submit the offset according to this part of the code above

        kafkaSteam.foreachRDD(new VoidFunction2<JavaRDD<ConsumerRecord<Object, Object>>,

                        Time>() {

            public void call(JavaRDD<ConsumerRecord<Object, Object>> consumerRecordJavaRDD,

                             Time time) throws Exception {

                if (consumerRecordJavaRDD.rdd().count() > 0) {

                    OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD

                            .rdd()).offsetRanges();



                                       ......



                    ((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);

                }

            }

        });

Turn on automatic submission at the same time

kafkaParams.put("enable.auto.commit", true);

After many tests, it is found that Kafka sends data after streaming stops. After streaming starts, it can continue to receive all new data after the last time, that is, the above two settings do not conflict.

Conclusion

20200722 Record: After compiling and running with scala2.10, I changed the scala version to 2.11, and compiled and ran again. It succeeded without error. It was amazing.

Remember SparkStreaming to connect to Kafka