Spark Streaming data flow restrictor DESCRIPTION

  Spark Streaming real-time data stream analysis processing, cut into a steady stream of data a time interval from the source of the received data is processed;
  streaming a clear distinction with the batch, batch data have a distinct boundary, the data size is known ; flows not process the data stream boundary, also the data size is unknown;
  because the data stream processing flow characteristics, so that the data stream is unpredictable, but also with the rate of data processing hardware, networks, and other related resources, in this case, the data rate of the stream does not come in a steady stream to be limiting, and that when Spark node failure, network failure, or data processing flow of a certain data continues to come down there, it would appear possible to Spark streaming OOM leading to crashes;
  in Spark streaming different data sources using different strategies speed limit, but it is in limiting policy limiting Socket policy data source or data sources which Kafka rate (rate) calculations are calculated from the algorithm used to PIDController;
  the following were introduced from source angle Socket data source and Kafka data sources limit the flow Management.

Calculating the updated rate limit

  Spark Streaming stream processing is actually based on a micro batch (microbatch), i.e. the data stream by a relatively small time interval data is cut into a section of the micro-batch process data;

  StreamingContext calling Start () will be activated when the rate controller (rateController) was added to StreamingListener listener;
  when each batch processing is completed to trigger the listener (RateController), using the batch processing end time, the processing delay time , scheduling delay time, the number of lines recorded incoming call PIDRateEstimator PID algorithm (PID the Controller) calculates the rate of the batch (rate) and the update rate limit (ratelimit) with the release rate-limiting;

override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) {
val elements = batchCompleted.batchInfo.streamIdToInputInfo

for {
  processingEnd <- batchCompleted.batchInfo.processingEndTime
  workDelay <- batchCompleted.batchInfo.processingDelay
  waitDelay <- batchCompleted.batchInfo.schedulingDelay
  elems <- elements.get(streamUID).map(_.numRecords)
 } computeAndPublish(processingEnd, elems, workDelay, waitDelay)
}

private def computeAndPublish(time: Long, elems: Long, workDelay: Long, waitDelay: Long): Unit =
Future[Unit] {
  val newRate = rateEstimator.compute(time, elems, workDelay, waitDelay)
  newRate.foreach { s =>
    rateLimit.set(s.toLong)
    publish(getLatestRate())
  }
}

Socket limiting data source

  Batch rate limit has been calculated above, where said data is received over the Socket data flow restrictor;
  the SocketInputStream receive method based upon the received data stored in the Buffer BlockGenerator call flow restrictor before being written to Buffer (RateLimiter) to limit the current to the write data;
  RateLimiter flow restrictor uses Google open source built RateLimiter Guava flow restrictor, such simply encapsulates the Guava flow restrictor;
  by use of the two may be Spark streaming spark.streaming.receiver.maxRate initial rate parameters and the maximum rate, spark.streaming.backpressure.initialRate; four parameter values may also be disposed PIDController related algorithms;
  RateLimiter flow restrictor is based on the basic principle of the token bucket algorithm is relatively simple to generate a constant rate token bucket, the bucket is full stop, need to remove the token from the token bucket processing the request, when no token bucket blocked waiting preferably, the algorithm for ensuring the system is not overwhelmed peak.

  private lazy val rateLimiter = GuavaRateLimiter.create(getInitialRateLimit().toDouble)
/**
  * Push a single data item into the buffer.
 */
 def addData(data: Any): Unit = {
  if (state == Active) {
//调用限流器等待
  waitToPush()
  synchronized {
    if (state == Active) {
      currentBuffer += data
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
  }
} else {
  throw new SparkException(
    "Cannot add data as BlockGenerator has not been started or has been stopped")
 }
}

def waitToPush() {
   //限流器申请令牌
   rateLimiter.acquire()
}

Guava library RateLimiter current limiter basic use:

//创建限流器,每秒产生令牌数1
    RateLimiter rateLimiter=RateLimiter.create(1);
    for (int i = 0; i < 10; i++) {
        //获得一个令牌,未申请到令牌则阻塞等待
        double waitTime = rateLimiter.acquire();
        System.out.println(String.format("id:%d time:%d waitTime:%f",i,System.currentTimeMillis(),waitTime));
    }

Kafka achieve current limit data source

  In Spark Streaming Kafka Kafka data packet pulling action would be as follows:
  1, taking the latest offset Kafka, partition
  2, by limiting the maximum number of messages rateController each partition can be pulled to
  3, in DirectKafkaInputDStream KafkaRDD created, in which it call related objects pull data

  Can also be seen from the above steps, the restriction has a partition Kafka offset (offset) can also limit the range of the number of messages fetched from Kafka pulled, so as to achieve the purpose of current limiting, Spark streaming kafka this is achieved by ;

Calculated for each partition rate limiting, the following steps:
  1, by acquiring the latest available seekToEnd offset to the current offset Comparative obtaining current delay offset all partitions
  a single partition delay offset = offset latest recording - the current partial recording shift amount
  2, each partition configuration item acquired maximum rate
(spark.streaming.kafka.maxRatePerPartition), back pressure rate calculation, calculated for each partition rate of back pressure is calculated as:
  single partition offset single partition Back pressure = delay amount / total delay of all partitions * rate limit
  rate limit (rateLimit): calculated dynamic come PIDController

  If the configuration of each partition maximum rate is the maximum rate and arranged to take items back pressure of both of the minimum value , is not configured then take the rate of back pressure as the rate limiting for each partition;

  3, the batch interval (batchDuration) * for each partition for each partition rate limiting the maximum number of messages =
  4, take the current partition both offset + partition maximum number of messages and the smallest among the latest offset, thereby controlling pull message rate;

  The current offset is greater than the maximum number of messages + partition offsets then take the latest offset otherwise, it is the latest current offset + partition maximum number of messages as pulling range Kafka Offset data ;

// 限制每个分区最大消息数
protected def clamp(
offsets: Map[TopicPartition, Long]): Map[TopicPartition, Long] = {

maxMessagesPerPartition(offsets).map { mmp =>
  mmp.map { case (tp, messages) =>
      val uo = offsets(tp)
      tp -> Math.min(currentOffsets(tp) + messages, uo)
  }
}.getOrElse(offsets)
}

  Whether Kafka Socket data source or data source are used Spark Streaming PIDController an algorithm for calculating the rate limit value, because the difference is only two ways of acquiring data source data determined characteristics of. Socket data source using the Guava RateLimiter, and Kafka himself realized the data source based Offsets limiting ;
  Above that version of the framework for the introduction of: Spark Streaming version 2.3.2 with spark-streaming-kafka-0-10_2.11;

Reference:
http://kafka.apache.org
http://spark.apache.org

Articles starting address: Solinx
https://mp.weixin.qq.com/s/yHStZgTAGBPoOMpj4e27Jg

Guess you like

Origin www.cnblogs.com/softlin/p/12215446.html