Spark Structured Streaming HelloWorld

Preface

Spark Structured Streaming+Kafka+Hbase Scala version tutorial, overall entrance.

text

1.Spark version selection

Select the version corresponding to your own server; document address:
https://spark.apache.org/docs/
This address will always show the version number. Just select the Spark in your own environment;
here I use 2.4.5; The latest version of the document release time is 3.3.3

2.Official example

After entering the corresponding version, you can find the main functions of Spark below.
Spark Streaming in the picture below has been clearly marked as the old API, and the new API is Structured Streaming. It is circled in red in the picture, so I am currently using the new API. Structured Streaming
Insert image description here

HelloWorld code

An official simple word count example

// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()

Batch code example

Official example, let me talk about my understanding here. streamingDF is a batch of data; foreachBatch is to loop through each batch; the data in the batch is in batchDF. You can see that the batch number is a self by printing the batch number batchId. increased number;

streamingDF.writeStream.foreachBatch {
    
     (batchDF: DataFrame, batchId: Long) =>
	//这行是缓存一下,这样后续的操作不会重复的执行前边transform操作了
  batchDF.persist()
  //对一个批次里的数据进行操作,具体根据是什么操作写法不一样
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  //完事必须把缓存释放了
  batchDF.unpersist()
}

Guess you like

Origin blog.csdn.net/lwb314/article/details/125974541