Source | Alice
Editor-in-Chief | Carol
封 图 | CSDN Download on Visual China
Selling | CSDN (ID: CSDNnews)
I believe that many small partners have already contacted SparkStreaming, so I wo n’t talk too much about the theory. Today ’s content is mainly to bring you a tutorial on integrating SparkStreaming with Kafka.
The code is included in the text, and interested friends can copy it and try it out!
Kafka review
Before the official start, let us review Kafka.
Illustration of core concepts
Broker: The machine where the Kafka service is installed is a broker
Producer: Producer of the message, responsible for writing data to the broker (push)
Consumer: the consumer of messages, responsible for pulling data from kafka (pull), the old version of the consumer needs to rely on zk, the new version does not need
Topic: The topic is equivalent to a classification of data. Different topics store data for different businesses – Topic: Distinguish business
Replication: copy, how many copies of data are saved (to ensure that the data is not lost) – copy: data security
Partition: Partition, is a physical partition, a partition is a file, a Topic can have 1 ~ n partitions, each partition has its own copy -partition: concurrent read and write
Consumer Group: a consumer group, a topic can have multiple consumers / groups consuming at the same time, if multiple consumers are in a consumer group, then they cannot consume data repeatedly – consumer group: improve consumer consumption speed and convenience Unified management
Note [1]: A Topic can be subscribed by multiple consumers or groups, and a consumer / group can also subscribe to multiple topics
Note [2]: The read data can only be read from the leader, and the write data can only be written to the leader. Follower will synchronize the data from the leader to make a copy! ! !
Common commands
Start kafka
/export/servers/kafka/bin/kafka-server-start.sh -daemon
/export/servers/kafka/config/server.properties
Stop kafka
/export/servers/kafka/bin/kafka-server-stop.sh
View topic information
/export/servers/kafka/bin/kafka-topics.sh --list --zookeeper node01:2181
Create topic
/export/servers/kafka/bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 3 --partitions 3 --topic test
View information on a topic
/export/servers/kafka/bin/kafka-topics.sh --describe --zookeeper node01:2181 --topic test
Delete topic
/export/servers/kafka/bin/kafka-topics.sh --zookeeper node01:2181 --delete --topic test
Start Producer-The producer of the console is generally used for testing
/export/servers/kafka/bin/kafka-console-producer.sh --broker-list node01:9092 --topic spark_kafka
Start consumer – the consumer of the console is generally used for testing
/export/servers/kafka/bin/kafka-console-consumer.sh --zookeeper node01:2181 --topic spark_kafka--from-beginning
Consumer's address to connect to borker
/export/servers/kafka/bin/kafka-console-consumer.sh --bootstrap-server node01:9092,node02:9092,node03:9092 --topic spark_kafka --from-beginning
Description of the two modes of integrating Kafka
This is also a hot topic of interview questions.
In development, we often use SparkStreaming to read the data in kafka in real time and then process it. After the spark1.3 version, kafkaUtils provides two methods for creating DStream:
1. Receiver receiving method:
KafkaUtils.createDstream (not used in development, just understand, but the interview may ask).
Receiver runs as a resident Task in the Executor to wait for data, but a Receiver is inefficient, you need to open multiple, and then manually merge the data (union), and then process, it is very troublesome
Which machine of Receiver hangs may lose data, so you need to enable WAL (pre-write log) to ensure data security, then the efficiency will be reduced!
The Receiver method is to connect the kafka queue through zookeeper, call the Kafka high-level API, the offset is stored in zookeeper, and is maintained by Receiver.
In order to ensure that the data is not lost, spark will also save an offset in Checkpoint during consumption, and data inconsistency may occur
So no matter from what angle, Receiver mode is not suitable for use in development, has been eliminated
2. Direct connection
KafkaUtils.createDirectStream (used in development, requires mastery)
The Direct method is to directly connect to the Kafka partition to obtain data. Reading data directly from each partition greatly improves parallelism.
Directly call Kafka low-level API (lower-level API), offset is stored and maintained by default, Spark is maintained in checkpoint by default, eliminating the inconsistency with zk
Of course, you can also manually maintain it and store the offset in mysql and redis.
Therefore, it can be used in development based on the Direct mode, and with the help of the characteristics of the Direct mode + manual operation, the data can be guaranteed exactly once
to sum up:
Receiver receiving method
Multiple Receivers accept data with high efficiency, but risk of losing data
Turning on the log (WAL) can prevent data loss, but it is inefficient to write data twice.
Zookeeper maintains an offset and may consume data repeatedly.
Use a high-level API
Direct connection
Read data directly in the Kafka partition without using Receiver
Does not use logging (WAL) mechanism
Spark maintains the offset itself
Use low-level API
Extension: Regarding message semantics
:
There are two versions of SparkStreaming and kafka integrated in development: 0.8 and 0.10+
The 0.8 version has Receiver and Direct modes (but the 0.8 version has more production environment problems, and the 0.8 version is not supported after Spark2.3).
After 0.10, only the direct mode has been retained (Reveiver mode is not suitable for production environments), and the 0.10 version API has changed (more powerful)
in conclusion:
We study and develop directly using the direct mode in the 0.10 version, but the interview about the difference between Receiver and Direct should be able to answer it.
spark-streaming-kafka-0-8 (understand)
1.Receiver
KafkaUtils.createDstream uses receivers to receive data, using Kafka's high-level consumer api, the offset is maintained by Receiver in zk, and the data received by all receivers will be saved in Spark executors, and then through Spark Streaming starts a job to process these data, which will be lost by default. The WAL log can be enabled. It synchronizes and saves the received data to a distributed file system such as HDFS. Ensure that the data can be recovered in the event of an error. Although this method, combined with the WAL mechanism, can ensure high reliability of zero data loss, the efficiency of WAL is lower, and there is no guarantee that the data is processed once and only once, and may be processed twice. Because Spark and ZooKeeper may be out of sync.
(Officials don't recommend this kind of integration now.)
Ready to work
1) Start the zookeeper cluster
zkServer.sh start
2) Start the kafka cluster
kafka-server-start.sh /export/servers/kafka/config/server.properties
3. Create topic
kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 1 --partitions 3 --topic spark_kafka
4. Send message to topic through shell command
kafka-console-producer.sh --broker-list node01:9092 --topic spark_kafka
5. Add kafka's pom dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.2.0</version>
</dependency>
API
Get the topic data in kafka through the receiver, you can run more receivers to read the data in kafak topic in parallel, here are 3
val receiverDStream: immutable.IndexedSeq[ReceiverInputDStream[(String, String)]] = (1 to 3).map(x => {
val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
stream
})
If WAL is enabled (spark.streaming.receiver.writeAheadLog.enable = true), the storage level can be set (default StorageLevel.MEMORY_AND_DISK_SER_2)
Code demo
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.immutable
object SparkKafka {
def main(args: Array[String]): Unit = {
//1.创建StreamingContext
val config: SparkConf =
new SparkConf().setAppName("SparkStream").setMaster("local[*]")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
//开启WAL预写日志,保证数据源端可靠性
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
val ssc = new StreamingContext(sc,Seconds(5))
ssc.checkpoint("./kafka")
//==============================================
//2.准备配置参数
val zkQuorum = "node01:2181,node02:2181,node03:2181"
val groupId = "spark"
val topics = Map("spark_kafka" -> 2)//2表示每一个topic对应分区都采用2个线程去消费,
//ssc的rdd分区和kafka的topic分区不一样,增加消费线程数,并不增加spark的并行处理数据数量
//3.通过receiver接收器获取kafka中topic数据,可以并行运行更多的接收器读取kafak topic中的数据,这里为3个
val receiverDStream: immutable.IndexedSeq[ReceiverInputDStream[(String, String)]] = (1 to 3).map(x => {
val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
stream
})
//4.使用union方法,将所有receiver接受器产生的Dstream进行合并
val allDStream: DStream[(String, String)] = ssc.union(receiverDStream)
//5.获取topic的数据(String, String) 第1个String表示topic的名称,第2个String表示topic的数据
val data: DStream[String] = allDStream.map(_._2)
//==============================================
//6.WordCount
val words: DStream[String] = data.flatMap(_.split(" "))
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
2.Direct
Direct method will periodically query the latest offset from the corresponding partition under the topic of kafka, and then process the data in each batch according to the offset range. .
The disadvantage of Direct is that it cannot use the kafka monitoring tool based on zookeeper
Direct has several advantages over Receiver:
Simplify parallelism
There is no need to create multiple kafka input streams and then union them. SparkStreaming will create the same number of RDD partitions as Kafka partitions, and will read data in parallel from Kafka, the number of RDD partitions in Spark and the partitions in Kafka The data is a one-to-one relationship.
Efficient
The zero loss of data achieved by Receiver is to save the data in the WAL in advance. The data will be copied once, which will cause the data to be copied twice, the first time is copied by kafka, and the other time is written to the WAL. Direct does not use WAL to eliminate this problem.
Exactly-once-semantics
Receiver reads kafka data through kafka high-level API to write the offset into zookeeper. Although this method can save the data in the WAL to ensure that the data is not lost, it may be because the offset stored in sparkStreaming and ZK is inconsistent As a result, the data was consumed many times.
Direct's Exactly-once-semantics (EOS) implements the low-level kafka API, and the offset is only saved by the ssc in the checkpoint, eliminating the problem of inconsistencies between the zk and ssc offsets.
API
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
Code demo
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkKafka2 {
def main(args: Array[String]): Unit = {
//1.创建StreamingContext
val config: SparkConf =
new SparkConf().setAppName("SparkStream").setMaster("local[*]")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
val ssc = new StreamingContext(sc,Seconds(5))
ssc.checkpoint("./kafka")
//==============================================
//2.准备配置参数
val kafkaParams = Map("metadata.broker.list" -> "node01:9092,node02:9092,node03:9092", "group.id" -> "spark")
val topics = Set("spark_kafka")
val allDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
//3.获取topic的数据
val data: DStream[String] = allDStream.map(_._2)
//==============================================
//WordCount
val words: DStream[String] = data.flatMap(_.split(" "))
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
spark-streaming-kafka-0-10
Explanation
In the spark-streaming-kafka-0-10 version, the API has some changes, the operation is more flexible, and it is used in development
pom.xml
<!--<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
</dependency>-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
API:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Create topic
/export/servers/kafka/bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 3 --partitions 3 --topic spark_kafka
Start producer
/export/servers/kafka/bin/kafka-console-producer.sh --broker-list node01:9092,node01:9092,node01:9092 --topic spark_kafka
Code demo
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkKafkaDemo {
def main(args: Array[String]): Unit = {
//1.创建StreamingContext
//spark.master should be set as local[n], n > 1
val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val ssc = new StreamingContext(sc,Seconds(5))//5表示5秒中对数据进行切分形成一个RDD
//准备连接Kafka的参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "node01:9092,node02:9092,node03:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "SparkKafkaDemo",
//earliest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,从头开始消费
//latest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,消费新产生的该分区下的数据
//none:topic各分区都存在已提交的offset时,从offset后开始消费;只要有一个分区不存在已提交的offset,则抛出异常
//这里配置latest自动重置偏移量为最新的偏移量,即如果有偏移量从偏移量位置开始消费,没有偏移量从新来的数据开始消费
"auto.offset.reset" -> "latest",
//false表示关闭自动提交.由spark帮你提交到Checkpoint或程序员手动维护
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("spark_kafka")
//2.使用KafkaUtil连接Kafak获取数据
val recordDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,//位置策略,源码强烈推荐使用该策略,会让Spark的Executor和Kafka的Broker均匀对应
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))//消费策略,源码强烈推荐使用该策略
//3.获取VALUE数据
val lineDStream: DStream[String] = recordDStream.map(_.value())//_指的是ConsumerRecord
val wrodDStream: DStream[String] = lineDStream.flatMap(_.split(" ")) //_指的是发过来的value,即一行数据
val wordAndOneDStream: DStream[(String, Int)] = wrodDStream.map((_,1))
val result: DStream[(String, Int)] = wordAndOneDStream.reduceByKey(_+_)
result.print()
ssc.start()//开启
ssc.awaitTermination()//等待优雅停止
}
}
Ok, this article mainly explains the process of integrating SparkStreaming with Kafka, and takes you to review the basic knowledge of Kafka. If it is useful to you, please try to "watch" ~
This article was first published by the author on the CSDN blog, the original link:
https://blog.csdn.net/weixin_44318830/article/details/105612516
【END】
More exciting recommendations
☞ 30 years of open source excitement: from free community to multi-billion dollar company
☞Understanding one of AI's greatest achievements: the limitations of convolutional neural networks
GitHub star 10,000+, the open source road of Apache's top project ShardingSphere
☞Intelligent O & M challenge under big promotion: How can Ali resist the "Double 11 Cat Night"?
☞ Ethernet Square 2.0 Custody Game and implement MPC
☞ I have written 9 MySQL interview questions for you very carefully.
Every "watching" you order, I take it seriously