Spark Streaming + Kafka + ES use notes

emm

Non-professional developers, little notes

Kafka

  1. When particularly large content Kafka being given at this time to set the next fetch.message.max.bytes like a relatively large value.
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,"fetch.message.max.bytes" -> "10485760" )
  1. About partitions
    number of fragments and the number of slices KafkaDirectStream Kafka provide the same
    if the operation amount far exceeds the consumption of re-fragmentation, you can repartition, or otherwise increase the number of Jobs running simultaneously.

Spark Streaming

  1. Increase the number of tasks running simultaneously

SparkStreaming default only start a Job, so use the core more if the number of tasks is not enough, then the core can not be fully utilized.

In order to increase the number of tasks you need to set spark.streaming.concurrentJobs parameters:

spark-submit --conf spark.streaming.concurrentJobs=8 ....

In fact a Job will be divided into multiple Tasks, each CPU core to execute a Task, Task execution completed the Core is released, that is to say eight partitions of Streaming, using 32 core not only perform 4 Jobs, according to page executor core usage Spark WebUI and appropriate increase or decrease concurrentJobs core usage.
I like this is relatively low utilization, and Streaming task has been to keep up, it can be appropriate to reduce the number of Cores.
Here Insert Picture Description
2. GC to optimize
the use of CMS garbage collector:

spark-submit --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC" 

Since using this collector, GC down time, memory is not easy to gauge, and breath on the fifth floor is not strenuous ~ do not know the specific principles, back up.

  1. About the cache
    data multiplexed with the position we must remember cache, otherwise it will be executed from the beginning processes.
    Cache type needs to be able to be serialized.

  2. Related to serialization and de-serialization
    content will Task Driver package serialization issue Executor, so it is necessary for all types of Task referenced can be serialized.
    Object not serializable error if the type is not serialized will be reported at this time need to implement serialization and de-serialization method, generally only need to implement anti-sequence method (readObject)
    Scala serializable need to add annotations, java implement serializable interface.

private def readObject(in: ObjectInputStream):Unit = {
		//调用默认的ReadObject函数
        in.defaultReadObject()
        //重新初始化一些无法被序列化的内容
        this.init(this.config_map)
    }

For the properties can not be serialized (Mysql connection, the Redis connection, etc., etc.) necessary to add before the transient modifier attribute indicating ignored during serialization, then constructed in the readObject.

There is also a lazy scala modifier, and then build upon that the use, so it can use the lazy + transient modifier, let it be rebuilt in use.

 @transient lazy val logger:Logger = LogManager.getLogger(this.getClass.getName)

The following Examples reference is a class packaging Kafka:

class KafkaSink[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
  /* This is the key idea that allows us to work around running into
     NotSerializableExceptions. */
  lazy val producer = createProducer()
  def send(topic: String, key: K, value: V): Future[RecordMetadata] =
    producer.send(new ProducerRecord[K, V](topic, key, value))
  def send(topic: String, value: V): Future[RecordMetadata] =
    producer.send(new ProducerRecord[K, V](topic, value))
}

object KafkaSink {
  import scala.collection.JavaConversions._
  def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
    val createProducerFunc = () => {
      val producer = new KafkaProducer[K, V](config)
      sys.addShutdownHook {
        // Ensure that, on executor JVM shutdown, the Kafka producer sends
        // any buffered messages to Kafka before shutting down.
        producer.close()
      }
      producer
    }
    new KafkaSink(createProducerFunc)
  }
  def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}

The variable broadcast
variable Task will be used during each sequence of transmission time, if you want to verify the above methods may be used to rewrite the print readObject some debugging information is recorded.

And some time constant and relatively large, complex content can be saved using variable broadcast, to ensure that each executor there is only one copy of the variable.

Such as Redis connection, MySQL connection, what rules can be configured to use broadcast variables.

Variable broadcast wrapped class also needs to be able serialization.

Radio variable is read-only variables.

Use detailed broadcast variables can see the following articles:
https://www.jianshu.com/p/3bd18acd2f7f

  1. The use of variable broadcast configuration updates:

Details can be seen in this article:
https://www.cnblogs.com/liuliliuli2017/p/6782687.html
a big brother to write wrapper classes:

// This wrapper lets us update brodcast variables within DStreams' foreachRDD
// without running into serialization issues
case class BroadcastWrapper[T: ClassTag](
                                          @transient private val ssc: StreamingContext,
                                          @transient private val _v: T) {

  @transient private var v = ssc.sparkContext.broadcast(_v)

  def update(newValue: T, blocking: Boolean = false): Unit = {
    // 删除RDD是否需要锁定
    v.unpersist(blocking)
    v = ssc.sparkContext.broadcast(newValue)
  }

  def value: T = v.value

  private def writeObject(out: ObjectOutputStream): Unit = {
    out.writeObject(v)
  }

  private def readObject(in: ObjectInputStream): Unit = {
    v = in.readObject().asInstanceOf[Broadcast[T]]
  }
}

ElasticSearch

  1. index name only lowercase, naked lesson ah = =
  2. If only a single ES account, you may be used es.nodes global configuration parameters, if a plurality of ES source configured to write several words MAP:
 var esin_setting = Map[String,String]("es.nodes"->"es1"
     ,"es.port"->"7001"
   )
 var esout_setting = Map[String,String]("es.nodes"->"es2"
   ,"es.port"->"7001"
   ,"es.scroll.size"->"5000")
val rdd = sc.esRDD("indexin", query,esin_setting)
rdd.saveToEs("esout/type1",esout_setting)

3. More complex configuration items, refer to the official website ElasticSearch configuration files:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

Guess you like

Origin blog.csdn.net/fnmsd/article/details/88818143