Spark Structured Streaming HelloWorld

前言
正文

前言

Spark Structured Streaming+Kafka+Hbase Scala版教程，整体入口。

正文

1.Spark版本选择

选择你自己服务器对应的版本；文档地址:
https://spark.apache.org/docs/
这个地址打开都是版本号，选择自己环境里的Spark就可以了；
这里我用的是2.4.5；文档发布时间最新版是3.3.3

2.官方例子

进入对应版本之后可以在下边找到Spark的主要功能，如下图
Spark Streaming已经明确标明是老API了，新的API就是Structured Streaming，图里用红圈圈出来了，所以我当前用的就是新API。Structured Streaming
在这里插入图片描述

HelloWorld代码

官方的一个简单的word count例子

// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()

批处理代码例子

官方例子，这里说一下我的理解，streamingDF是一个批次的数据；foreachBatch就是循环每个批次；批次里的数据就在batchDF，打印批次号batchId就能看到这个批次号是个自增的数字；

streamingDF.writeStream.foreachBatch {
    
     (batchDF: DataFrame, batchId: Long) =>
	//这行是缓存一下,这样后续的操作不会重复的执行前边transform操作了
  batchDF.persist()
  //对一个批次里的数据进行操作，具体根据是什么操作写法不一样
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  //完事必须把缓存释放了
  batchDF.unpersist()
}

Spark Structured Streaming HelloWorld

Spark Structured Streaming HelloWorld

前言

正文

1.Spark版本选择

2.官方例子

HelloWorld代码

批处理代码例子

猜你喜欢