Super detailed! An article explaining in detail how SparkStreaming integrates Kafka! Attached code can be practiced

Source | Alice

Editor-in-Chief | Carol

封 图 | CSDN Download on Visual China

Selling | CSDN (ID: CSDNnews)

I believe that many small partners have already contacted SparkStreaming, so I wo n’t talk too much about the theory. Today ’s content is mainly to bring you a tutorial on integrating SparkStreaming with Kafka.

The code is included in the text, and interested friends can copy it and try it out!

Kafka review

Before the official start, let us review Kafka.

  • Illustration of core concepts

Broker: The machine where the Kafka service is installed is a broker

Producer: Producer of the message, responsible for writing data to the broker (push)

Consumer: the consumer of messages, responsible for pulling data from kafka (pull), the old version of the consumer needs to rely on zk, the new version does not need

Topic: The  topic is equivalent to a classification of data. Different topics store data for different businesses  – Topic: Distinguish business

Replication: copy, how many copies of data are saved (to ensure that the data is not lost)  – copy: data security

Partition: Partition, is a physical partition, a partition is a file, a Topic can have 1 ~ n partitions, each partition has its own copy  -partition: concurrent read and write

Consumer Group: a consumer group, a topic can have multiple consumers / groups consuming at the same time, if multiple consumers are in a consumer group, then they cannot consume data repeatedly  – consumer group: improve consumer consumption speed and convenience Unified management

Note [1]: A Topic can be subscribed by multiple consumers or groups, and a consumer / group can also subscribe to multiple topics

Note [2]: The read data can only be read from the leader, and the write data can only be written to the leader. Follower will synchronize the data from the leader to make a copy! ! !

  • Common commands

Start kafka

/export/servers/kafka/bin/kafka-server-start.sh -daemon 

/export/servers/kafka/config/server.properties 

Stop kafka

/export/servers/kafka/bin/kafka-server-stop.sh 

View topic information

/export/servers/kafka/bin/kafka-topics.sh --list --zookeeper node01:2181

Create topic

/export/servers/kafka/bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 3 --partitions 3 --topic test

View information on a topic

/export/servers/kafka/bin/kafka-topics.sh --describe --zookeeper node01:2181 --topic test

Delete topic

/export/servers/kafka/bin/kafka-topics.sh --zookeeper node01:2181 --delete --topic test

Start Producer-The producer of the console is generally used for testing

/export/servers/kafka/bin/kafka-console-producer.sh --broker-list node01:9092 --topic spark_kafka

Start consumer – the consumer of the console is generally used for testing

/export/servers/kafka/bin/kafka-console-consumer.sh --zookeeper node01:2181 --topic spark_kafka--from-beginning

Consumer's address to connect to borker

/export/servers/kafka/bin/kafka-console-consumer.sh --bootstrap-server node01:9092,node02:9092,node03:9092 --topic spark_kafka --from-beginning 



Description of the two modes of integrating Kafka

This is also a hot topic of interview questions.

In development, we often use SparkStreaming to read the data in kafka in real time and then process it. After the spark1.3 version, kafkaUtils provides two methods for creating DStream:

1. Receiver receiving method:

  • KafkaUtils.createDstream (not used in development, just understand, but the interview may ask).

  • Receiver runs as a resident Task in the Executor to wait for data, but a Receiver is inefficient, you need to open multiple, and then manually merge the data (union), and then process, it is very troublesome

  • Which machine of Receiver hangs may lose data, so you need to enable WAL (pre-write log) to ensure data security, then the efficiency will be reduced!

  • The Receiver method is to connect the kafka queue through zookeeper, call the Kafka high-level API, the offset is stored in zookeeper, and is maintained by Receiver.

  • In order to ensure that the data is not lost, spark will also save an offset in Checkpoint during consumption, and data inconsistency may occur

  • So no matter from what angle, Receiver mode is not suitable for use in development, has been eliminated

2. Direct connection

  • KafkaUtils.createDirectStream (used in development, requires mastery)

  • The Direct method is to directly connect to the Kafka partition to obtain data. Reading data directly from each partition greatly improves parallelism.

  • Directly call Kafka low-level API (lower-level API), offset is stored and maintained by default, Spark is maintained in checkpoint by default, eliminating the inconsistency with zk

  • Of course, you can also manually maintain it and store the offset in mysql and redis.

  • Therefore, it can be used in development based on the Direct mode, and with the help of the characteristics of the Direct mode + manual operation, the data can be guaranteed exactly once

to sum up:

  • Receiver receiving method

  1. Multiple Receivers accept data with high efficiency, but risk of losing data

  2. Turning on the log (WAL) can prevent data loss, but it is inefficient to write data twice.

  3. Zookeeper maintains an offset and may consume data repeatedly.

  4. Use a high-level API

  • Direct connection

  1. Read data directly in the Kafka partition without using Receiver

  2. Does not use logging (WAL) mechanism

  3. Spark maintains the offset itself

  4. Use low-level API

Extension: Regarding message semantics
:

There are two versions of SparkStreaming and kafka integrated in development: 0.8 and 0.10+

The 0.8 version has Receiver and Direct modes (but the 0.8 version has more production environment problems, and the 0.8 version is not supported after Spark2.3).

After 0.10, only the direct mode has been retained (Reveiver mode is not suitable for production environments), and the 0.10 version API has changed (more powerful)

in conclusion:

We study and develop directly using the direct mode in the 0.10 version, but the interview about the difference between Receiver and Direct should be able to answer it.

spark-streaming-kafka-0-8 (understand)

1.Receiver

KafkaUtils.createDstream uses receivers to receive data, using Kafka's high-level consumer api, the offset is maintained by Receiver in zk, and the data received by all receivers will be saved in Spark executors, and then through Spark Streaming starts a job to process these data, which will be lost by default. The WAL log can be enabled. It synchronizes and saves the received data to a distributed file system such as HDFS. Ensure that the data can be recovered in the event of an error. Although this method, combined with the WAL mechanism, can ensure high reliability of zero data loss, the efficiency of WAL is lower, and there is no guarantee that the data is processed once and only once, and may be processed twice. Because Spark and ZooKeeper may be out of sync.

(Officials don't recommend this kind of integration now.)

  • Ready to work

1) Start the zookeeper cluster

zkServer.sh start

2) Start the kafka cluster

kafka-server-start.sh  /export/servers/kafka/config/server.properties

3. Create topic

kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 1 --partitions 3 --topic spark_kafka

4. Send message to topic through shell command

kafka-console-producer.sh --broker-list node01:9092 --topic  spark_kafka

5. Add kafka's pom dependency

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
    <version>2.2.0</version>
</dependency>
  • API

Get the topic data in kafka through the receiver, you can run more receivers to read the data in kafak topic in parallel, here are 3

 val receiverDStream: immutable.IndexedSeq[ReceiverInputDStream[(String, String)]] = (1 to 3).map(x => {
      val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
      stream
    })

If WAL is enabled (spark.streaming.receiver.writeAheadLog.enable = true), the storage level can be set (default StorageLevel.MEMORY_AND_DISK_SER_2)

Code demo

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.immutable

object SparkKafka {
  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    val config: SparkConf = 
new SparkConf().setAppName("SparkStream").setMaster("local[*]")
      .set("spark.streaming.receiver.writeAheadLog.enable", "true")
//开启WAL预写日志,保证数据源端可靠性
    val sc = new SparkContext(config)
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))
    ssc.checkpoint("./kafka")
//==============================================
    //2.准备配置参数
    val zkQuorum = "node01:2181,node02:2181,node03:2181"
    val groupId = "spark"
    val topics = Map("spark_kafka" -> 2)//2表示每一个topic对应分区都采用2个线程去消费,
//ssc的rdd分区和kafka的topic分区不一样,增加消费线程数,并不增加spark的并行处理数据数量
    //3.通过receiver接收器获取kafka中topic数据,可以并行运行更多的接收器读取kafak topic中的数据,这里为3个
    val receiverDStream: immutable.IndexedSeq[ReceiverInputDStream[(String, String)]] = (1 to 3).map(x => {
      val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
      stream
    })
    //4.使用union方法,将所有receiver接受器产生的Dstream进行合并
    val allDStream: DStream[(String, String)] = ssc.union(receiverDStream)
    //5.获取topic的数据(String, String) 第1个String表示topic的名称,第2个String表示topic的数据
    val data: DStream[String] = allDStream.map(_._2)
//==============================================
    //6.WordCount
    val words: DStream[String] = data.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2.Direct

Direct method will periodically query the latest offset from the corresponding partition under the topic of kafka, and then process the data in each batch according to the offset range. .

  • The disadvantage of Direct is that it cannot use the kafka monitoring tool based on zookeeper

  • Direct has several advantages over Receiver:

  1. Simplify parallelism

    There is no need to create multiple kafka input streams and then union them. SparkStreaming will create the same number of RDD partitions as Kafka partitions, and will read data in parallel from Kafka, the number of RDD partitions in Spark and the partitions in Kafka The data is a one-to-one relationship.

  2. Efficient 

    The zero loss of data achieved by Receiver is to save the data in the WAL in advance. The data will be copied once, which will cause the data to be copied twice, the first time is copied by kafka, and the other time is written to the WAL. Direct does not use WAL to eliminate this problem.

  3. Exactly-once-semantics

    Receiver reads kafka data through kafka high-level API to write the offset into zookeeper. Although this method can save the data in the WAL to ensure that the data is not lost, it may be because the offset stored in sparkStreaming and ZK is inconsistent As a result, the data was consumed many times.

        Direct's Exactly-once-semantics (EOS) implements the low-level kafka API, and the offset is only saved by the ssc in the checkpoint, eliminating the problem of inconsistencies between the zk and ssc offsets.

  • API

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

Code demo

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}


object SparkKafka2 {
  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    val config: SparkConf = 
new SparkConf().setAppName("SparkStream").setMaster("local[*]")
    val sc = new SparkContext(config)
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))
    ssc.checkpoint("./kafka")
    //==============================================
    //2.准备配置参数
    val kafkaParams = Map("metadata.broker.list" -> "node01:9092,node02:9092,node03:9092", "group.id" -> "spark")
    val topics = Set("spark_kafka")
    val allDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
    //3.获取topic的数据
    val data: DStream[String] = allDStream.map(_._2)
    //==============================================
    //WordCount
    val words: DStream[String] = data.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}


spark-streaming-kafka-0-10

  • Explanation

In the spark-streaming-kafka-0-10 version, the API has some changes, the operation is more flexible, and it is used in development

  • pom.xml

<!--<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>-->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>
  • API:

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

  • Create topic

/export/servers/kafka/bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 3 --partitions 3 --topic spark_kafka

  • Start producer

/export/servers/kafka/bin/kafka-console-producer.sh --broker-list node01:9092,node01:9092,node01:9092 --topic spark_kafka

  • Code demo

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkKafkaDemo {
  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    //spark.master should be set as local[n], n > 1
    val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))//5表示5秒中对数据进行切分形成一个RDD
    //准备连接Kafka的参数
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "node01:9092,node02:9092,node03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "SparkKafkaDemo",
      //earliest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,从头开始消费
      //latest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,消费新产生的该分区下的数据
      //none:topic各分区都存在已提交的offset时,从offset后开始消费;只要有一个分区不存在已提交的offset,则抛出异常
      //这里配置latest自动重置偏移量为最新的偏移量,即如果有偏移量从偏移量位置开始消费,没有偏移量从新来的数据开始消费
      "auto.offset.reset" -> "latest",
      //false表示关闭自动提交.由spark帮你提交到Checkpoint或程序员手动维护
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val topics = Array("spark_kafka")
    //2.使用KafkaUtil连接Kafak获取数据
    val recordDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,//位置策略,源码强烈推荐使用该策略,会让Spark的Executor和Kafka的Broker均匀对应
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))//消费策略,源码强烈推荐使用该策略
    //3.获取VALUE数据
    val lineDStream: DStream[String] = recordDStream.map(_.value())//_指的是ConsumerRecord
    val wrodDStream: DStream[String] = lineDStream.flatMap(_.split(" ")) //_指的是发过来的value,即一行数据
    val wordAndOneDStream: DStream[(String, Int)] = wrodDStream.map((_,1))
    val result: DStream[(String, Int)] = wordAndOneDStream.reduceByKey(_+_)
    result.print()
    ssc.start()//开启
    ssc.awaitTermination()//等待优雅停止
  }
}

Ok, this article mainly explains the process of integrating SparkStreaming with Kafka, and takes you to review the basic knowledge of Kafka. If it is useful to you, please try to "watch" ~

This article was first published by the author on the CSDN blog, the original link:

https://blog.csdn.net/weixin_44318830/article/details/105612516

【END】

More exciting recommendations

30 years of open source excitement: from free community to multi-billion dollar company

☞Understanding one of AI's greatest achievements: the limitations of convolutional neural networks

GitHub star 10,000+, the open source road of Apache's top project ShardingSphere

Academician Zheng Guangting of Hong Kong University of Science and Technology inquires about the future and exposes the latest application and practice of AI

☞Intelligent O & M challenge under big promotion: How can Ali resist the "Double 11 Cat Night"?

Ethernet Square 2.0 Custody Game and implement MPC

I have written 9 MySQL interview questions for you very carefully.

Every "watching" you order, I take it seriously

Published 1984 original articles · 40,000+ praises · 18.44 million views

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105697457