Spark structured streaming+kafka data input and output are not equal

Preface

I was recently writing a structured streaming program and found that the input and output data of Kafka were not proportional. Normally, if there is only one consumer in your group, the input and output should be equal.

But the data on my producer and consumer sides are different, as shown below:

Insert image description here

For my two topics, test19 and test20, I can see that the output data of the two programs is basically 10 times the input. This must be a problem, so I will do a few experiments to verify what the problem is and try to solve this problem. .

Experiment 1:

Implementation process: Get input from kafka and output it directly to another topic:

Experiment 1 code:
Data production and consumption are not written anymore.
The basic process is:
java producer->kafka_topic34->spark consumer->kafka_topic35->java consumer

Spark code: Scala

import org.apache.spark.sql.SparkSession

object KafkaOutputDataSizeTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //解决找不到HADOOP环境问题
    System.setProperty("hadoop.home.dir", "C://hadoop-3.1.1")

    //初始化spark
    val spark = SparkSession
      .builder
      .appName("KafkaOutputDataSizeTest")
      .master("local[*]")
      .getOrCreate()

    //这行是必须的,好像用于类型转换
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "cdpcluster-1.futuremove.cn:9092,cdpcluster-2.futuremove.cn:9092,cdpcluster-3.futuremove.cn:9092")
      .option("subscribe", "test34")
      .load()
    //    var df2 =df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    df.writeStream
      .format("kafka")
      .option("checkpointLocation", "test34_point")
      .option("kafka.bootstrap.servers", "cdpcluster-1.futuremove.cn:9092,cdpcluster-2.futuremove.cn:9092,cdpcluster-3.futuremove.cn:9092")
      .option("topic", "test34")
      .start().awaitTermination()
  }
}


Achieve results:

The input and output magnitudes are the same
Insert image description here

Experiment 2:

Implementation process: Make a simple filter to divide the data into 1 and 2

    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "cdpcluster-1:9092,cdpcluster-2:9092,cdpcluster-3:9092")
      .option("subscribe", "test36")
      .load()
    var df2 = df.selectExpr("CAST(value AS STRING)").as[(String)]
    var df3 = df2.map(x => {
    
    
      //      println(x)
      val obj = JSON.parseObject(x)
      KafkaTestJsonBean(obj.getString("id"), obj.getString("message"))
    })
    //    df2.map(x => {
    
    
    //      val obj = JSON.parseObject(x)
    //
    //    })
    //    df.show()
    df3.writeStream.foreachBatch {
    
     (batchDF: Dataset[KafkaTestJsonBean], batchId: Long) =>
      //如果没有对应报文就不去执行插入数据库操作否则会频繁的建立hbase数据库连接
      val df_1 = batchDF.filter(x => {
    
    
        x.key.toInt % 2 == 0
      })
      val df_2 = batchDF.filter(x => {
    
    
        x.key.toInt % 2 == 1
      })
      df_1.write
        .format("kafka")
        .option("checkpointLocation", "test37_point")
        .option("kafka.bootstrap.servers", "cdpcluster-1:9092,cdpcluster-2:9092,cdpcluster-3:9092")
        .option("topic", "test37").save()
      df_2.write
        .format("kafka")
        .option("checkpointLocation", "test37_point")
        .option("kafka.bootstrap.servers", "cdpcluster-1:9092,cdpcluster-2:9092,cdpcluster-3:9092")
        .option("topic", "test37").save()
//      batchDF.unpersist()
//      batchDF2.unpersist()
    }.start().awaitTermination()

Achieve results:

When the input remains unchanged, the output is twice the input, and adding a filter data is three times.

cause:

Action and transform operations in Spark. When executing an action, all the transforms of the action will be calculated. That is, a filter is equivalent to reading kafka once when entering the output, and the second filter result set is saved. When reading kafka again. So it is equivalent to reading kafka twice.

final conclusion:

You can take a look at the description of action and transform operations in Spark. The official also provides solutions. It's the cache, persist() function.
The following code comes from the official Spark2.4.5 website.
After adding the batchDF.persist() line, the input and output are finally consistent.
But don't forget to release the cache batchDF.unpersist(). I did a test and found that unpersist was not called all the time. In the end, my memory overflowed. In short, just refer to the official writing method.

streamingDF.writeStream.foreachBatch {
    
     (batchDF: DataFrame, batchId: Long) =>
  batchDF.persist()
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  batchDF.unpersist()
}

Official website address:
http://spark.apache.org/docs/2.4.5/structured-streaming-programming-guide.html

Guess you like

Origin blog.csdn.net/lwb314/article/details/115399210