Actual Techniques: Spark Streaming application scenario data read mode -Kafka

Personal blog navigation page (click on the right link to open a personal blog): Daniel take you on technology stack 

Outline

Spark Streaming real-time support for reading multiple input source data, including Kafka, flume, socket stream and the like. In addition to real-time input source other than Kafka, because of our business scenario is not involved, this will not be discussed. This article is mainly focused on our current business scenario, only concerned Spark Streaming read Kafka data. Spark Streaming official offers two ways to read Kafka data:

  • First Receiver-based Approach. This kind of support the first official read mode, and provides the zero data loss (zero-data loss) support at Spark 1.2;
  • First Direct Approach (No Receivers). This kind of read mode incorporated Spark 1.3.

The presence of these two very different ways to read, of course, advantages and disadvantages. Next, let us dissect these two specific data read mode.

一、Receiver-based Approach

As previously described, Spark official Kafka first to provide data based on the consumption patterns Receiver. But there will be a program might fail loss of data, after the introduction of a configuration parameter in 1.2 Spark spark.streaming.receiver.writeAheadLog.enableto avoid this risk. The following is the official's words:

under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure.

Receiver-based read mode

Receiver-based way of reading Kafka Kafka higher-order (high-level) api to achieve consumption data based on Kafka. After submitting Spark Streaming task, Spark cluster will be set aside to specifically designated Receivers, continuous, asynchronous read Kafka data read time intervals and offsets each read range can be configured by parameters. Receiver stored in the read data, the specific StorageLevelmanner specified by the user, such as MEMORY_ONLYand the like. When the driver trigger batch tasks in the data Receivers will be transferred to the rest of the Executors of execution. After performing, Receivers will be updated accordingly offsets ZooKeeper's. If you want to make sure that at least once in read mode can be set spark.streaming.receiver.writeAheadLog.enableto true. Receiver specific implementation process shown below:

Enter Caption

Receiver-based read achieved

Kafka's high-level data read mode allows users to focus on the read data, without attention or maintenance offsets consumer, which reduces the amount of code and the user's workload and relatively simple. Thus, at the beginning of the introduction of Spark Streaming calculation engine, we used this way to give priority to the read data, the specific code is as follows:

 /*读取kafka数据函数*/
  def getKafkaInputStream(zookeeper: String,
                            topic: String,
                            groupId: String,
                            numRecivers: Int,
                            partition: Int,
                            ssc: StreamingContext): DStream[String] = {
    val kafkaParams = Map(
      ("zookeeper.connect", zookeeper),
      ("auto.offset.reset", "largest"),
      ("zookeeper.connection.timeout.ms", "30000"),
      ("fetch.message.max.bytes", (1024 * 1024 * 50).toString),
      ("group.id", groupId)
    )
    val topics = Map(topic -> partition / numRecivers)

    val kafkaDstreams = (1 to numRecivers).map { _ =>
      KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc,
        kafkaParams,
        topics,
        StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
    }

    ssc.union(kafkaDstreams)
  }

As the code above, the function getKafkaInputStreamprovided zookeepertopicgroupIdnumReceiverspartitionand ssc, which are passed to the function corresponding to:

  • zookeeper: ZooKeeper connection information
  • topic: topic information Kafka entered
  • groupId: consumer information
  • numReceivers: the number of intended receiver on, and used to adjust concurrency
  • partition: Kafka corresponding to the number of partitions topic

The main parameters used to connect more than a few Kafka Kafka and read data. Specific steps performed are as follows:

  • Kafka read related parameters, wherein  zookeeper.connecti.e. passing zookeeper incoming parameters; auto.offset.resetset the start of the latest data is read from the topic; zookeeper.connection.timeout.msrefers zookeepr connection timeout period to prevent network instability; fetch.message.max.bytesrefers to a single read data size; group.idit is designated a consumer.
  • Specifies the number of concurrent topic, when the specified number of receivers, but since the number of receivers is less than the number of Partition topic, so each receiver will play above the corresponding thread to read a different partition.
  • Kafka read data, numReceivers parameters for this we need to specify how many Executor as Receivers, Receivers are more open in order to improve application throughput.
  • Data read union for associating a plurality of Receiver

Receiver-based reading questions

Using Reveiver-based ways to meet the needs of some of our scenes, and based on this abstract some of the micro-batch, in-memory computing model. In specific scenarios, we have to this way of doing some optimization:

  • Data loss prevention. Do checkpoint operations and configuration spark.streaming.receiver.writeAheadLog.enableparameters;
  • Improving receiver data throughput. Using MEMORY_AND_DISK_SERreads data, improve the single Receiver memory or transfer large degree of parallelism, distributed to the plurality of data to the Receiver.

The above approach to a certain extent, to meet our application scenarios, such as micro-batch and in-memory computing model. But at the same time because these two factors as well as other aspects of the problem will lead to a variety of situations:

  • Configuration spark.streaming.receiver.writeAheadLog.enableparameters, the processing to be backed up before each batch to the checkpoint in the log directory, which reduces the efficiency of data processing, in turn, increased the pressure Receiver end; Further since the data backup mechanism, the load will be affected, a high load the risk of delay occurs, resulting in application crashes.
  • Using MEMORY_AND_DISK_SERreduced memory requirements. But the impact speed calculation to some extent,
  • Single Receiver memory. Since the receiver is part of the Executor, then in order to increase throughput, improved Receiver memory. But in each batch computing, batch involved in the calculation and will not be using so much memory, resulting in a serious waste of resources.
  • Degree of parallelism used to store a plurality of Receiver Kafka data. Receiver read data is asynchronous and does not participate in the calculation. If the opening degree of parallelism to balance the high throughput is not worth.
  • Asynchronous Receiver Executor and calculated, then the causal factors encountered in the network, leading to delays in computing, computing queue has been increasing, while the Receiver has been at the receiving data, it is very easy to cause the program to crash.
  • When the program fails to recover, there may be data portion of the floor, but the program fails, the case of offsets is not updated, which leads to duplication of data consumption.

To return to the provision of the above problems, reduce the use of resources, we later adopted Direct Approach Kafka to read the data, specifically the next elaborate.

二、Direct Approach (No Receivers)

Different from the Receiver-based methods of data consumption, Spark official introduction of Kafka data consumption Direct mode at 1.3 Spark. Relative to the Receiver-based approach, Direct mode has advantages in the following areas:

  • Parallel simplified (Simplified Parallelism). Not now need to create multiple input and union sources, Kafka topic of partition and the partition RDD-one correspondence, the official described as follows:

No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

  • Efficiency (Efficiency). Receiver-based to ensure zero data loss (zero-data loss) needs to be configured spark.streaming.receiver.writeAheadLog.enable, in this way need to save two copies of data, waste of storage space also affects efficiency. The Direct way this problem does not exist.

Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.

  • Strong consistent semantics (Exactly-once semantics). High-level data, but Offsets are kept by the Spark Streaming consumed by Zookeeper. By parameter configuration can be achieved at-least once consumer, this situation may repeat consumption data.

The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).

Direct reading mode

Direct mode uses Kafka's consumer api simple way to read the data, without going through ZooKeeper, this way no longer require specialized Receiver to continuously read the data. When a batch job is triggered, data is read by the Executor, and participate in other Executor of data calculation process to go. read the driver to decide how many offsets, and offsets referred to checkpoints to maintain. Will trigger the next batch task, then read Kafka and calculated data by the Executor. From this process we can find ways Direct Receiver without having to read the data, but need to read the data calculation, so the data consumption Direct less demanding way of memory, and only need to consider the required memory can batch computing; another batch when tasks accumulate, it will not affect the data accumulation. Specific FIG reads as follows:

Enter Caption

Direct reading realization

Spark Streaming provides a number of overloaded data read Kafka's method, this paper focus on two Scala-based approach, which will be used in our application scenario, the specific method code is as follows:

  • A method createDirectStreamin ssca StreamingContext; kafkaParamsspecific configuration see the configuration from among the Receiver-based, with the same; there is to be noted that fromOffsets , which is used to specify what starts reading data from the offset.
def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag,
    R: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      fromOffsets: Map[TopicAndPartition, Long],
      messageHandler: MessageAndMetadata[K, V] => R
  ): InputDStream[R] = {
    val cleanedHandler = ssc.sc.clean(messageHandler)
    new DirectKafkaInputDStream[K, V, KD, VD, R](
      ssc, kafkaParams, fromOffsets, cleanedHandler)
  }
  • CreateDirectStream method, the method requires only three parameters, which kafkaParamsis still the same, what changes are not, but which has a configuration auto.offset.resetmay be used to specify start reading data from the largest or smallest; topicrefers to the Topic Kafka, You can specify more. The method of specific code are as follows:
def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      topics: Set[String]
  ): InputDStream[(K, V)] = {
    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
    val kc = new KafkaCluster(kafkaParams)
    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
    new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
      ssc, kafkaParams, fromOffsets, messageHandler)
  }

In a practical application scenario, we will use the combination of the two methods, a direction roughly divided into two aspects:

  • Application start. When program development and on-line, yet Kafka consumption data, from the largest to read data at this time, the second method;
  • Application restart. When the resource for other reasons, the network as a result of failure to restart the program, make sure to start reading data from the previous offsets at this time we need to use the first method to ensure that our scene.

Overall direction, we use the above method to meet our needs, of course, the specific strategies we are not discussed in this blog entry, there will be a special follow-up article to introduce. Kafka data codes read from the largest or smallest achieve the following:

/**
    * 读取kafka数据,从最新的offset开始读
    *
    * @param ssc         : StreamingContext
    * @param kafkaParams : kafka参数
    * @param topics      : kafka topic
    * @return : 返回流数据
    */
private def getDirectStream(ssc: StreamingContext,
                            kafkaParams: Map[String, String],
                            topics: Set[String]): DStream[String] = {
  val kafkaDStreams = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
    ssc,
    kafkaParams,
    topics
  )
  kafkaDStreams.map(_._2)
}

Fails to restart logic code as follows:

/**
    * 如果已有offset,则从offset开始读数据
    *
    * @param ssc         : StreamingContext
    * @param kafkaParams : kafkaParams配置参数
    * @param fromOffsets : 已有的offsets
    * @return : 返回流数据
    */
private def getDirectStreamWithOffsets(ssc: StreamingContext,
                                       kafkaParams: Map[String, String],
                                       fromOffsets: Map[TopicAndPartition, Long]): DStream[String] = {
  val kfkData = try {
    KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](
      ssc,
      kafkaParams,
      fromOffsets,
      (mmd: MessageAndMetadata[String, String]) => mmd.message()
    )
  } catch { //offsets失效, 从最新的offsets读。
    case _: Exception =>
    val topics = fromOffsets.map { case (tap, _) =>
      tap.topic
    }.toSet
    getDirectStream(ssc, kafkaParams, topics)
  }
  kfkData
}

Code fromOffsetsparameter acquired from an external storage and processing require conversion code is as follows:

val fromOffsets = offsets.map { consumerInfo =>
  TopicAndPartition(consumerInfo.topic, consumerInfo.part) -> consumerInfo.until_offset
}.toMap

The method provides Kafka read from the specified offsets of the data. If you find an abnormal reading data, we believe that offsets a failure, such a situation to catch this exception, Kafka then read the data from the largest office.

Direct reading problems

In practical applications, Direct Approach a good way to meet our needs, compared with the Receiver-based mode, there are advantages in the following areas:

  • Reduce resource. Direct not need Receivers, Executors all of their applications to participate in the computational tasks; Receiver-based and it is necessary to read Kafka special Receivers do not participate in the calculation data. So apply for the same resources, Direct be able to support larger businesses.

  • Reduce memory. Receiver-based Receiver in other Exectuor is asynchronous, and continued to receive data for the small volume of business scenarios Fortunately, if encountered heavy traffic, the need to improve the Receiver's memory, but involved in the calculation of the Executor and do not need so much memory. And because there is no Direct Receiver, but the calculation of read data, and direct calculation, the low memory requirements. Practical applications we can put the original 10G is now down to about 2-4G.

  • More robust. Receiver-based method requires Receivers to read asynchronous continuous data, the network encountered factors, such as storage load, resulting in the accumulation of real-time tasks arise, but Receivers still continues to read data, this situation can easily lead to the collapse of the calculation. Direct no such concerns, which when triggered batch Driver computing tasks, will read the data and calculations. Queue accumulation occurs does not cause failure of the program.

As for other advantages, such as simplified parallel (Simplified Parallelism), efficiency (Efficiency) and strongly consistent semantics (Exactly-once semantics) listed before and will not be described. Although Direct have these advantages, but there are some deficiencies, as follows:

  • Increase costs. Direct requires the user to adopt checkpoint or third-party storage to maintain offsets, rather than the Receiver-based as to maintain Offsets by ZooKeeper, this increases the user's development costs.
  • Monitoring visualization. Receiver-based manner specified topic in a specified consumer consumption can be monitored by ZooKeeper, while Direct is no such convenient, if done monitoring and visualization, you need to put human development.

Attached Java / C / C ++ / machine learning / Algorithms and Data Structures / front-end / Android / Python / programmer reading / single books books Daquan:

(Click on the right to open there in the dry personal blog): Technical dry Flowering
===== >> ① [Java Daniel take you on the road to advanced] << ====
===== >> ② [+ acm algorithm data structure Daniel take you on the road to advanced] << ===
===== >> ③ [database Daniel take you on the road to advanced] << == ===
===== >> ④ [Daniel Web front-end to take you on the road to advanced] << ====
===== >> ⑤ [machine learning python and Daniel take you entry to the Advanced Road] << ====
===== >> ⑥ [architect Daniel take you on the road to advanced] << =====
===== >> ⑦ [C ++ Daniel advanced to take you on the road] << ====
===== >> ⑧ [ios Daniel take you on the road to advanced] << ====
=====> > ⑨ [Web security Daniel take you on the road to advanced] << =====
===== >> ⑩ [Linux operating system and Daniel take you on the road to advanced] << = ====

There is no unearned fruits, hope you young friends, friends want to learn techniques, overcoming all obstacles in the way of the road determined to tie into technology, understand the book, and then knock on the code, understand the principle, and go practice, will It will bring you life, your job, your future a dream.

Published 47 original articles · won praise 0 · Views 284

Guess you like

Origin blog.csdn.net/weixin_41663412/article/details/104860416