Spark Streaming backpressure mechanism Quest

1. backpressure principle mechanism

Spark Streaming backpressure mechanism is introduced Spark 1.5.0 new features can be dynamically adjusted according to the rate of uptake efficiency.

When the batch time (Batch Processing Time) is greater than the spacing batches (Batch Interval, i.e. BatchDuration), it indicates that data processing speed is less than the data rate of uptake, duration is too long or explosion source data, data in memory is likely to cause accumulate, eventually leading Executor OOM or task Ben collapse.

In this case, if the data source based on Kafka Receiver, you can be controlled by setting the maximum input rate spark.streaming.receiver.maxRate; if Direct-based data sources (such as Kafka Direct Stream), may be provided by spark.streaming .kafka.maxRatePerPartition to control the maximum input rate. Of course, in the case of advance by measuring the pressure and the flow will not exceed the expected peak, generally no problem to set these parameters. However, the maximum value, may not represent the optimal value, preferably also dynamically estimate the optimum rate depending on batches where each batch process. In Spark 1.5.0 above, it can be achieved through the backpressure mechanism. Open backpressure mechanism that spark.streaming.backpressure.enabled set to true, Spark Streaming automatically adjusts the input rate according to the processing capacity, so that when traffic peaks can still ensure maximum throughput and performance.

override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) {
    val elements = batchCompleted.batchInfo.streamIdToInputInfo

    for {
      // 处理结束时间
      processingEnd <- batchCompleted.batchInfo.processingEndTime
      // 处理时间,即`processingEndTime` - `processingStartTime`
      workDelay <- batchCompleted.batchInfo.processingDelay
      // 在调度队列中的等待时间,即`processingStartTime` - `submissionTime`
      waitDelay <- batchCompleted.batchInfo.schedulingDelay
      // 当前批次处理的记录数
      elems <- elements.get(streamUID).map(_.numRecords)
    } computeAndPublish(processingEnd, elems, workDelay, waitDelay)
  }

It can be seen, and then they call that computeAndPublishmethod, as follows:

private def computeAndPublish(time: Long, elems: Long, workDelay: Long, waitDelay: Long): Unit =
    Future[Unit] {
      // 根据处理时间、调度时间、当前Batch记录数,预估新速率
      val newRate = rateEstimator.compute(time, elems, workDelay, waitDelay)
      newRate.foreach { s =>
      // 设置新速率
        rateLimit.set(s.toLong)
      // 发布新速率
        publish(getLatestRate())
      }
    }

Deeper, specific calls that rateEstimator.computemethod to estimate the new rates, as follows:

def compute(
      time: Long,
      elements: Long,
      processingDelay: Long,
      schedulingDelay: Long): Option[Double]

2. backpressure mechanism related parameters

  • spark.streaming.backpressure.enabled
    default value false, if the backpressure mechanisms enabled.

  • spark.streaming.backpressure.initialRate
    Default None initial maximum reception rate. Applies only to Receiver Stream, does not apply to Direct Stream. Of type integer, default direct reads all, in the case of an open, limit data consumption should first batch, because the program cold start queue which has a large backlog, prevent all first reading, cause the system to clog

  • spark.streaming.kafka.maxRatePerPartition
    of type integer, default direct reads each thread all consumption, limit the maximum amount of data read per second per partition kafka

  • spark.streaming.stopGracefullyOnShutdown
    elegant closed to ensure that when kill tasks can be done with the last batch of data, and then close the program, mandatory kill will not happen cause data processing interruption, loss of data is not processed
注意: 只有 3 激活的时候,每次消费的最大数据量,就是设置的数据量,如果不足这个数,就有多少读多少,如果超过这个数字,就读取这个数字的设置的值
只有 1+3 激活的时候,每次消费读取的数量最大会等于3设置的值,最小是spark根据系统负载自动推断的值,消费的数据量会在这两个范围之内变化根据系统情况,但第一次启动会有多少读多少数据。此后按 1+3 设置规则运行
1+2+3 同时激活的时候,跟上一个消费情况基本一样,但第一次消费会得到限制,因为我们设置第一次消费的频率了。
  • spark.streaming.backpressure.rateEstimator
    Default pid, the rate controller, this controller only supports the Spark default, can be customized.

  • spark.streaming.backpressure.pid.proportional
    default value of 1.0, only non-negative. Current weight difference between the rate and the rate on the last batch of the total weight of the control signal contribution. The default value can be used.

  • spark.streaming.backpressure.pid.integral
    default value 0.2, only non-negative. The cumulative contribution ratio to the total error weighting control signals. The default value can be used.

  • spark.streaming.backpressure.pid.derived
    default value of 0.0, only non-negative. Weight ratio error variance contribution to the total weight of the control signal. The default value can be used.

  • spark.streaming.backpressure.pid.minRate

    The default value of 100, only a positive number, the minimum rate.

3. Use of backpressure mechanisms

//启用反压机制
conf.set("spark.streaming.backpressure.enabled","true")
//最小摄入条数控制
conf.set("spark.streaming.backpressure.pid.minRate","1")
//最大摄入条数控制
conf.set("spark.streaming.kafka.maxRatePerPartition","12")
//初始最大接收速率控制
conf.set("spark.streaming.backpressure.initialRate","10")    

To ensure backpressure mechanism really works before the Spark application does not crash, it is necessary to control each batch maximum uptake rate. Direct Stream In an example, such as Direct Stream Kafka, the parameters can be controlled by spark.streaming.kafka.maxRatePerPartition. This parameter represents the number of pieces of data for each partition maximum intake per second. Suppose the number of data pieces BatchDuration 10 seconds, spark.streaming.kafka.maxRatePerPartition is 12, the number of topic Kafka partition 3, one batch (Batch) for the maximum read 360 (3 12 is 10 = 360). At the same time, it is noted that the parameter also represents the maximum rate for the entire application life cycle, even if the maximum back pressure adjustment will not exceed the parameters.

Guess you like

Origin blog.51cto.com/14309075/2414995