Two ways SparkStreaming consumes Kafka

Need to import pom dependencies

<dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
     <version>2.0.2</version>
</dependency>

One, the receiver way to read

In this way, Receiver is used to obtain data. Receiver uses Kafka's high-level Consumer API to achieve data consumption.

The data obtained from Kafka by receiver is stored in the memory of Spark Executor, and then the job started by Spark Streaming will process the data.

However, in the default configuration, this method may lose data due to the failure of the underlying layer, because Kafka's advanced API will not maintain the offset during consumption.

To enable a highly reliable mechanism and zero data loss, enable Spark Streaming's Write Ahead Log (WAL) mechanism. This mechanism will synchronously write the received Kafka data to the write-ahead log on the distributed file system (such as HDFS). Therefore, even if the spark task fails, the data in the write-ahead log can be used for recovery.

Points to note

1. In this way, the partition of the topic consumed in Kafka has nothing to do with the partition of the RDD in Spark. Therefore, in KafkaUtils.createStream(), the number of partitions we increase will only increase the number of threads for Receiver to read kafka_partition. Will not increase the parallelism of Spark processing data.

2. You can create multiple Kafka input DStreams and use different consumer groups and topics to receive data in parallel through multiple receivers.

3. If a fault-tolerant file system, such as HDFS, has a write-ahead log mechanism enabled, the received data will be copied to the write-ahead log. Therefore, in KafkaUtils.createStream(), the persistence level set is StorageLevel.MEMORY_AND_DISK_SER

4. KafkaUtils has deleted the api read by receiver in the newer version of jar

Although it has been deleted, we still need to know how to implement it:

package com.spark

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
import org.apache.spark.{
    
    SparkContext, SparkConf}


object SparkStreamingReceiverKafka {
    
    
  def main(args: Array[String]) {
    
    
    val conf = new SparkConf()
    conf.setAppName("SparkStreamingReceiverKafka")
    conf.set("spark.streaming.kafka.maxRatePerPartition", "10")
	conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")//预写日志需要改这个配置
    conf.setMaster("local[2]")

    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    val ssc = new StreamingContext(sc, Seconds(5)) // 创建streamingcontext入口

	ssc.checkpoint("hdfs://localhost:9000/log")//预写日志的hdfs地址需要通过checkpoint设置
	
    val zks = "zk1,zk2,zk3"
    val groupId = "kafka_spark_xf"
    val map : Map[String, Int] = Map("kafka_spark" -> 2) // topic名称为kafka_spark,每次使用2个线程读取数据

	//参数: 流对象 zookeeper集群 消费者id map参数,日志等级必须要有落盘操作
    val dframe = KafkaUtils.createStream(ssc, zks, groupId, map, StorageLevel.MEMORY_AND_DISK_SER_2)
    
    dframe.foreachRDD(rdd => {
    
     // 操作方式和rdd差别不大
      rdd.foreachPartition(partition =>{
    
    
        partition.foreach(println)
      })
    })
  }
}

Second, read in direct mode

This new direct method that is not based on Receiver was introduced in Spark 1.3 to ensure a more robust mechanism. Instead of using Receiver to receive data, this method will periodically query Kafka to obtain the latest offset of each topic+partition, thereby defining the range of offset for each batch. When a job that processes data is started, Kafka's simple consumer api will be used to obtain data in the specified offset range of Kafka.

This method has the following advantages:

1. Simplify parallel reading. If you want to read multiple partitions, you do not need to create multiple input DStreams and then perform union operations on them. Spark will create as many RDD partitions as Kafka partitions, and will read data from Kafka in parallel. Therefore, there is a one-to-one mapping relationship between Kafka partition and RDD partition.

2. High performance: If you want to ensure zero data loss, you need to turn on the WAL mechanism in the receiver-based method. This method is actually inefficient, because the data is actually copied twice. Kafka itself has a highly reliable mechanism to copy one copy of the data, and then copy one copy to the WAL. Based on the direct method, it does not rely on Receiver and does not need to turn on the WAL mechanism. As long as the data is highly available in Kafka, it can be restored through the copy of Kafka.

3. Once and only once transaction mechanism:

Based on the receiver method, Kafka's high-level API is used to save the consumed offset in ZooKeeper. This is the traditional way of consuming Kafka data. This method combined with the WAL mechanism can ensure high reliability with zero data loss, but it cannot guarantee that the data is processed once and only once, and may be processed twice. Because the receiver will periodically submit the offset to zk, Spark and ZooKeeper may not be synchronized, that is to say, there may be an offset of 0 1024 on zk, and it may be 0 1500 when spark is synchronized, then 1023 1500 will be re-synchronized during synchronization Be consumed

Based on the direct approach, using Kafka's simple api, Spark Streaming itself is responsible for tracking the consumption offset and storing it in the checkpoint. Spark itself must be synchronized, so it can guarantee that the data is consumed once and only once.

package com.stream

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe


object StreamFromKafka {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setAppName("StreamWordCount").setMaster("local[2]")
    val sc = new StreamingContext(conf,Seconds(10))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "192.168.182.146:9092,192.168.182.147:9092,192.168.182.148:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "group1"
    )

    /**
      * LocationStrategies.PreferBrokers() 仅仅在你 spark 的 executor 在相同的节点上,优先分配到存在  kafka broker 的机器上;
      * LocationStrategies.PreferConsistent(); 大多数情况下使用,一致性的方式分配分区所有 executor 上。(主要是为了分布均匀)
      * 新的Kafka使用者API将预先获取消息到缓冲区。因此,出于性能原因,Spark集成将缓存的消费者保留在执行程序上(而不是为每个批处理重新创建它们),并且更喜欢在具有适当使用者的主机位置上安排分区,这一点很重要。
      *在大多数情况下,您应该使用LocationStrategies.PreferConsistent,如上所示。这将在可用执行程序之间均匀分配分区。如果您的执行程序与Kafka代理在同一主机上,请使用PreferBrokers,它更愿意为该分区安排Kafka领导者的分区。
      */
    val topics = Array("test")
    val stream = KafkaUtils.createDirectStream[String, String](
      sc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    val kafkaStream = stream.map(record => (record.key, record.value))
    val words = kafkaStream.map(_._2)
    val pairs = words.map {
    
     x => (x,1) }
    val wordCounts = pairs.reduceByKey(_+_)
    wordCounts.print()
    sc.start()
    sc.awaitTermination()
  }

}

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/114402955