Spark Structured Streaming HelloWorld
Preface
Spark Structured Streaming+Kafka+Hbase Scala version tutorial, overall entrance.
text
1.Spark version selection
Select the version corresponding to your own server; document address:
https://spark.apache.org/docs/
This address will always show the version number. Just select the Spark in your own environment;
here I use 2.4.5; The latest version of the document release time is 3.3.3
2.Official example
After entering the corresponding version, you can find the main functions of Spark below.
Spark Streaming in the picture below has been clearly marked as the old API, and the new API is Structured Streaming. It is circled in red in the picture, so I am currently using the new API. Structured Streaming
HelloWorld code
An official simple word count example
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
Batch code example
Official example, let me talk about my understanding here. streamingDF is a batch of data; foreachBatch is to loop through each batch; the data in the batch is in batchDF. You can see that the batch number is a self by printing the batch number batchId. increased number;
streamingDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
//这行是缓存一下,这样后续的操作不会重复的执行前边transform操作了
batchDF.persist()
//对一个批次里的数据进行操作,具体根据是什么操作写法不一样
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
//完事必须把缓存释放了
batchDF.unpersist()
}