Spark Streaming Scenario Application - Kafka Data Reading Method

Overview

Spark Streaming supports the reading of a variety of real-time input source data, including Kafka, flume, socket streams, and more. Real-time input sources other than Kafka, which are not involved in our business scenario, will not be discussed here. This article mainly focuses on our current business scenario and only focuses on the way Spark Streaming reads Kafka data. Spark Streaming officially provides two ways to read Kafka data:

  • One is Receiver-based Approach. This read mode is officially supported for the first time, and supports zero-data loss in Spark 1.2;
  • One is Direct Approach (No Receivers). This reading method was introduced in Spark 1.3.

These two reading methods are very different, and of course, each has its own advantages and disadvantages. Next, let us analyze these two data reading methods in detail.

一、Receiver-based Approach

As mentioned above, Spark officially provided the Receiver-based Kafka data consumption mode for the first time. However, there is a possibility that the program fails to lose data, and a configuration parameter was introduced in Spark 1.2 spark.streaming.receiver.writeAheadLog.enableto avoid this risk. Here is the official word:

under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure.

Receiver-based reading method

The Receiver-based Kafka reading method is based on the Kafka high-level api to realize the consumption of Kafka data. After submitting the Spark Streaming task, the Spark cluster will delineate the specified Receivers to read Kafka data in a dedicated, continuous and asynchronous manner. The read time interval and the range of offsets for each read can be configured by parameters. The read data is stored in the Receiver, and the specific StorageLevelway is specified by the user, such MEMORY_ONLYas. When the driver triggers the batch task, the data in the Receivers will be transferred to the remaining Executors for execution. After execution, Receivers will update ZooKeeper offsets accordingly. To ensure at least once read mode, it can be set spark.streaming.receiver.writeAheadLog.enableto true. The specific Receiver execution flow is shown in the following figure:

Enter image description

Receiver-based read implementation

Kafka's high-level data reading method allows users to focus on the read data without paying attention to or maintaining the consumer's offsets, which reduces the user's workload and code amount and is relatively simple. Therefore, when we first introduced the Spark Streaming computing engine, we gave priority to reading data in this way. The specific code is as follows:

 /*读取kafka数据函数*/
  def getKafkaInputStream(zookeeper: String,
                            topic: String,
                            groupId: String,
                            numRecivers: Int,
                            partition: Int,
                            ssc: StreamingContext): DStream[String] = {
    val kafkaParams = Map(
      ("zookeeper.connect", zookeeper),
      ("auto.offset.reset", "largest"),
      ("zookeeper.connection.timeout.ms", "30000"),
      ("fetch.message.max.bytes", (1024 * 1024 * 50).toString),
      ("group.id", groupId)
    )
    val topics = Map(topic -> partition / numRecivers)

    val kafkaDstreams = (1 to numRecivers).map { _ =>
      KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc,
        kafkaParams,
        topics,
        StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
    }

    ssc.union(kafkaDstreams)
  }

As shown in the above code, the function getKafkaInputStreamprovides zookeeper, topic, groupId, numReceivers, partitionand sscthe incoming functions respectively correspond to:

  • zookeeper: ZooKeeper connection information
  • topic: topic information entered in Kafka
  • groupId: consumer information
  • numReceivers: The number of receivers to be opened and used to adjust concurrency
  • partition: the number of partitions for the corresponding topic in Kafka

The above parameters are mainly used to connect to Kafka and read Kafka data. The specific steps are as follows:

  • Kafka-related read parameter configuration, which zookeeper.connectis the incoming zookeeper parameter; auto.offset.resetset to read data from the latest topic; zookeeper.connection.timeout.msrefers to the zookeeper connection timeout to prevent network instability; fetch.message.max.bytesrefers to a single read data size; group.idit specifies the consumer.
  • Specify the concurrent number of topic. After specifying the number of receivers, but because the number of receivers is less than the number of partitions of the topic, a corresponding thread will be started on each receiver to read different partitions.
  • To read Kafka data, the parameter of numReceivers is used to specify how many Executors we need as Receivers. Opening multiple Receivers is to improve application throughput.
  • Union is used to associate data read by multiple Receivers

Receiver-based read issues

The Reviver-based method is adopted to meet some of our scene requirements, and based on this, some micro-batch and memory computing models are abstracted. In specific application scenarios, we have also made some optimizations to this method:

  • Data loss prevention. Do checkpoint operations and configure spark.streaming.receiver.writeAheadLog.enableparameters;
  • Improve receiver data throughput. Use MEMORY_AND_DISK_SERmethods to read data, increase the memory of a single Receiver or increase the degree of parallelism, and distribute the data to multiple Receivers.

The above processing methods meet our application scenarios to a certain extent, such as micro-batch and memory computing models. But at the same time, due to these two aspects and some other factors, various problems will also arise:

  • Configure spark.streaming.receiver.writeAheadLog.enableparameters. Before each processing, you need to back up the logs in the batch to the checkpoint directory, which reduces the data processing efficiency, which in turn increases the pressure on the Receiver side; in addition, due to the data backup mechanism, it will be affected by the load, and the load will be high. There is a risk of delays, causing the app to crash.
  • Use MEMORY_AND_DISK_SERto reduce memory requirements. But it affects the speed of calculation to a certain extent
  • Single Receiver memory. Since the receiver is also a part of the Executor, in order to improve the throughput, increase the memory of the Receiver. However, in each batch calculation, the batch participating in the calculation does not use so much memory, resulting in a serious waste of resources.
  • Improve parallelism and use multiple Receivers to save Kafka data. Receiver reads data asynchronously and does not participate in calculations. It is not cost-effective to open a higher degree of parallelism to balance throughput.
  • If the Receiver and the Executor of the calculation are asynchronous, if the network and other factors are encountered, the calculation will be delayed, the calculation queue has been increasing, and the Receiver has been receiving data, which is very easy to cause the program to crash.
  • When the program fails to recover, it is possible that the data partially falls, but the program fails and the offsets are not updated, which leads to repeated data consumption.

In order to solve the above problems and reduce resource usage, we later used Direct Approach to read Kafka data, which will be described in detail below.

二、Direct Approach (No Receivers)

Different from the Receiver-based data consumption method, Spark officially introduced the Direct method of Kafka data consumption in Spark 1.3. Compared with the Receiver-based method, the Direct method has the following advantages:

  • Simplified Parallelism. There is no need to create and union multiple input sources. The partition of Kafka topic corresponds to the partition of RDD one by one. The official description is as follows:

No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

  • Efficiency. Receiver-based ensures that zero-data loss needs to be configured spark.streaming.receiver.writeAheadLog.enable. This method needs to save two copies of data, which wastes storage space and affects efficiency. The Direct method does not have this problem.

Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.

  • Exactly-once semantics. High-level data is consumed by Spark Streaming, but Offsets are saved by Zookeeper. Through parameter configuration, at-least once consumption can be achieved. In this case, it is possible to consume data repeatedly.

The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results,your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).

Direct reading method

The Direct method uses Kafka's simple consumer api method to read data without going through ZooKeeper. This method no longer requires a special Receiver to continuously read data. When the batch task is triggered, the Executor reads the data and participates in the data calculation process of other Executors. The driver decides how many offsets to read, and the offsets are maintained by checkpoints. The next batch task will be triggered, and the Kafka data will be read and calculated by the Executor. From this process, we can find that the Direct mode does not need the Receiver to read the data, but needs to read the data during the calculation, so the data consumption of the Direct mode does not require high memory, and only needs to consider the memory required for batch calculation; When tasks pile up, it will not affect the data pile up. The specific reading method is as follows:

Enter image description

Direct read implementation

Spark Streaming provides some overloaded methods for reading Kafka data. This article focuses on two Scala-based methods, which will be used in our application scenarios. The specific method codes are as follows:

  • In the method createDirectStream, it sscis StreamingContext; kafkaParamsfor the specific configuration, see the configuration in Receiver-based, which is the same; what needs to be pointed out here is fromOffsetsthat it is used to specify the offset from which to start reading data.
def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag,
    R: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      fromOffsets: Map[TopicAndPartition, Long],
      messageHandler: MessageAndMetadata[K, V] => R
  ): InputDStream[R] = {
    val cleanedHandler = ssc.sc.clean(messageHandler)
    new DirectKafkaInputDStream[K, V, KD, VD, R](
      ssc, kafkaParams, fromOffsets, cleanedHandler)
  }
  • In the method createDirectStream, this method only needs 3 parameters, which kafkaParamsare still the same, and there is no change, but there is a configuration auto.offset.resetthat can be used to specify whether to start reading data from the largest or the smallest; topicit refers to the topic in Kafka, Multiple can be specified. The specific method code provided is as follows:
def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      topics: Set[String]
  ): InputDStream[(K, V)] = {
    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
    val kc = new KafkaCluster(kafkaParams)
    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
    new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
      ssc, kafkaParams, fromOffsets, messageHandler)
  }

In practical application scenarios, we will use the two methods in combination, and the general directions are divided into two aspects:

  • Application starts. When the program is developed and launched, and the Kafka data has not been consumed, the data is read from the largest, and the second method is adopted;
  • App restarts. When the program fails to restart due to other reasons such as resources, network, etc., it is necessary to ensure that the data is read from the last offsets. At this time, the first method needs to be used to ensure our scenario.

In the general direction, we use the above methods to meet our needs. Of course, we will not discuss the specific strategies in this article, and there will be special articles to introduce them later. The code for reading Kafka data from largest or smallest is as follows:

/**
    * 读取kafka数据,从最新的offset开始读
    *
    * @param ssc         : StreamingContext
    * @param kafkaParams : kafka参数
    * @param topics      : kafka topic
    * @return : 返回流数据
    */
private def getDirectStream(ssc: StreamingContext,
                            kafkaParams: Map[String, String],
                            topics: Set[String]): DStream[String] = {
  val kafkaDStreams = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
    ssc,
    kafkaParams,
    topics
  )
  kafkaDStreams.map(_._2)
}

The logic code for program failure restart is as follows:

/**
    * 如果已有offset,则从offset开始读数据
    *
    * @param ssc         : StreamingContext
    * @param kafkaParams : kafkaParams配置参数
    * @param fromOffsets : 已有的offsets
    * @return : 返回流数据
    */
private def getDirectStreamWithOffsets(ssc: StreamingContext,
                                       kafkaParams: Map[String, String],
                                       fromOffsets: Map[TopicAndPartition, Long]): DStream[String] = {
  val kfkData = try {
    KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](
      ssc,
      kafkaParams,
      fromOffsets,
      (mmd: MessageAndMetadata[String, String]) => mmd.message()
    )
  } catch { //offsets失效, 从最新的offsets读。
    case _: Exception =>
    val topics = fromOffsets.map { case (tap, _) =>
      tap.topic
    }.toSet
    getDirectStream(ssc, kafkaParams, topics)
  }
  kfkData
}

The parameters in the code are fromOffsetsobtained from external storage and need to be converted. The code is as follows:

val fromOffsets = offsets.map { consumerInfo =>
  TopicAndPartition(consumerInfo.topic, consumerInfo.part) -> consumerInfo.until_offset
}.toMap

This method provides reading Kafka data from specified offsets. If it is found that the read data is abnormal, we think that the offsets fail, in this case, we will catch the exception, and then read the Kafka data from the largest.

Direct read issues

In practical applications, the Direct Approach method meets our needs well. Compared with the Receiver-based method, it has the following advantages:

  • Reduce resources. Direct does not need Receivers, and all Executors it applies for participate in computing tasks; while Receiver-based requires specialized Receivers to read Kafka data and do not participate in computing. Therefore, with the same resource application, Direct can support larger businesses.

  • Reduce memory. The Receiver-based Receiver is asynchronous with other Exectuors and continuously receives data. It is fine for scenarios with small business volume. If you encounter a large business volume, you need to increase the memory of the Receiver, but the Executor participating in the calculation does not need that much. of memory. Direct, because there is no Receiver, but reads data during calculation, and then calculates directly, so the memory requirements are very low. In practical applications, we can reduce the original 10G to about 2-4G now.

  • Robustness is better. The Receiver-based method requires Receivers to read data asynchronously and continuously. Therefore, due to factors such as network and storage load, real-time tasks accumulate, but Receivers continue to read data. This situation can easily lead to computing crashes. Direct does not have this concern, and its Driver will read the data and calculate it only when the batch calculation task is triggered. The accumulation of the queue does not cause the program to fail.

As for other advantages, such as Simplified Parallelism, Efficiency, and Exactly-once semantics have been listed before and will not be introduced here. Although Direct has the above advantages, it also has some shortcomings, as follows:

  • raise costs. Direct requires users to use checkpoint or third-party storage to maintain offsets, unlike Receiver-based, which uses ZooKeeper to maintain offsets, which increases user development costs.
  • Monitoring visualization. In the Receiver-based method, the consumption of the specified topic and the specified consumer can be monitored by ZooKeeper, while Direct does not have this convenience. If monitoring and visualization are achieved, human development is required.

Summarize

This article introduces the Kafka data reading method based on Spark Streaming, including Receiver-based and Direct. Both methods have their own advantages and disadvantages, but relatively speaking, Direct is suitable for more business scenarios and has better scalability. As for how to choose the above two methods, in addition to business scenarios, it is also related to the team. If it is in the early stage of application, in order to quickly iterate the application, you can consider using the first method; if you want to use it in depth, it is recommended to use the second method. This article only introduces two reading methods, and does not involve reading strategies, optimization and other issues. These will be covered in detail in subsequent articles.

About the author

Xu Shengguo, graduated from Dalian University of Technology with a master's degree, is a data R&D engineer in 360 Data Center, mainly responsible for project architecture and R&D work based on Spark Streaming. Email: [email protected]. If you have any questions, please contact us by email. Welcome to communicate.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324540186&siteId=291194637