StructuredStreaming built-in data source and implement custom data source

Copyright: https: //shirukai.github.io/ | https://blog.csdn.net/shirukai/article/details/86687672

StructuredStreaming built-in data source and implement custom data source

Release Notes:

Spark:2.3/2.4

Code repository: https://github.com/shirukai/spark-structured-datasource.git

1 Structured built-in input source Source

Document official website: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

Source Options Fault-tolerant Notes
File Source maxFilesPerTrigger: Each trigger be considered in the new maximum number of files (default: no maximum) latestFirst: whether the new file to handle the latest and useful (default value: false) when there is a large backlog of files
fileNameOnly: whether based only the following method to check the new file name instead of the full file path (default value: false).
Fault Tolerance Support Support glob path, but it does not support multiple paths to the slogan division
Socket Source host: To host connection, you must specify the
port: port to be connected, you must specify
It does not support fault tolerance
Rate Source rowsPerSecond (e.g. 100, default: 1): How many lines should be generated per second.
RampUpTime (e.g. 5s, the default values: 0s): How long rowsPerSecond acceleration before generating speed becomes. Use second granularity finer than would be truncated to an integer seconds. numPartitions (e.g. 10, the default value: the default Spark parallelism): the partition number of generated row
Fault Tolerance Support
Kafka Source http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html Fault Tolerance Support

1.1 File Source

The file directory written as a data stream read. Supported file formats: text, csv, json, orc, parquet

example

Code Location: org.apache.spark.sql.structured.datasource.example

val source = spark
  .readStream
  // Schema must be specified when creating a streaming source DataFrame.
  .schema(schema)
  // 每个trigger最大文件数量
  .option("maxFilesPerTrigger",100)
  // 是否首先计算最新的文件,默认为false
  .option("latestFirst",value = true)
  // 是否值检查名字,如果名字相同,则不视为更新,默认为false
  .option("fileNameOnly",value = true)
  .csv("*.csv")

1.2 Socket Source

UTF8 text data read from the Socket. It is generally used for testing, using a data transmission port number to the port nc -lc Socket listening.

example

Code Location: org.apache.spark.sql.structured.datasource.example

val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9090)
  .load()

1.3 Rate Source

Generate the data specified in lines per second, each output comprises a line timestampand value. Wherein timestampa Timestamptime distribution type content information, and valuea Longcounting message comprises from zero as the first line type. This source for testing and benchmarking.

example

Code Location: org.apache.spark.sql.structured.datasource.example

    val rate = spark.readStream
      .format("rate")
      // 每秒生成的行数,默认值为1
      .option("rowsPerSecond", 10)
      .option("numPartitions", 10)
      .option("rampUpTime",0)
      .option("rampUpTime",5)
      .load()

1.4 Kafka Source

Document official website: http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

example

Code Location: org.apache.spark.sql.structured.datasource.example

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .load()

2 Structured output source built Sink

Sink Supported Output Modes Options Fault-tolerant Notes
File Sink Append path: output path (to be specified) Support fault-tolerant (exactly-once) Support partitioning write
Kafka Sink Append,Update,Complete See the Kafka Integration Guide Support for fault tolerance (at-least-once) Kafka Integration Guide
Foreach Sink Append,Update,Complete None Foreach Guide
ForeachBatch Sink Append,Update,Complete None Foreach Guide
Console Sink Append,Update,Complete numRows: each number (Default: 20) printed row trigger
TRUNCATE: truncation output (the default value is too long: true
Memory Sink Append,Complete None Table is the name of the query

2.1 File Sink

The output to a file, supported formats parquet, csv, orc, json, etc.

example

Code Location: org.apache.spark.sql.structured.datasource.example

val fileSink = source.writeStream
  .format("parquet")
  //.format("csv")
  //.format("orc")
 // .format("json")
  .option("path", "data/sink")
  .option("checkpointLocation", "/tmp/temporary-" + UUID.randomUUID.toString)
  .start()

2.2 Console Sink

The output to the console

example

Code Location: org.apache.spark.sql.structured.datasource.example

    val consoleSink = source.writeStream
      .format("console")
      // 是否压缩显示
      .option("truncate", value = false)
      // 显示条数
      .option("numRows", 30)
      .option("checkpointLocation", "/tmp/temporary-" + UUID.randomUUID.toString)
      .start()

2.3 Memory Sink

Outputs the result to memory, you need to specify the table name in memory. You can use sql query

example

Code Location: org.apache.spark.sql.structured.datasource.example


    val memorySink = source.writeStream
      .format("memory")
      .queryName("memorySinkTable")
      .option("checkpointLocation", "/tmp/temporary-" + UUID.randomUUID.toString)
      .start()


    new Thread(new Runnable {
      override def run(): Unit = {
        while (true) {
          spark.sql("select * from memorySinkTable").show(false)
          Thread.sleep(1000)
        }
      }
    }).start()
    memorySink.awaitTermination()

2.4 Kafka Sink

Outputs the result to Kafka, needs to be converted into DataFrame key, value two, or Topic, key, value three

example

Code Location: org.apache.spark.sql.structured.datasource.example

    import org.apache.spark.sql.functions._
    import spark.implicits._
    val kafkaSink = source.select(array(to_json(struct("*"))).as("value").cast(StringType),
      $"timestamp".as("key").cast(StringType)).writeStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("checkpointLocation", "/tmp/temporary-" + UUID.randomUUID.toString)
      .option("topic", "hiacloud-ts-dev")
      .start()

2.5 ForeachBatch Sink(2.4)

Applicable to the scenario is the same for a batch application of the write mode. This method incoming batch of DataFrame and batchId. Version 2.3 of this method after only but supports only micro-batch mode.

example

Code Location: org.apache.spark.sql.structured.datasource.example

    val foreachBatchSink = source.writeStream.foreachBatch((batchData: DataFrame, batchId) => {
      batchData.show(false)
    }).start()

2.6 Foreach Sink

Foreach 每一条记录,通过继承ForeachWriter[Row],实现open(),process(),close()方法。在open方法了我们可以获取一个资源连接,如MySQL的连接。在process里我们可以获取一条记录,并处理这条数据发送到刚才获取资源连接的MySQL中,在close里我们可以关闭资源连接。注意,foreach是对Partition来说的,同一个分区只会调用一次open、close方法,但对于每条记录来说,都会调用process方法。

用例

代码位置:org.apache.spark.sql.structured.datasource.example

    val foreachSink = source.writeStream
        .foreach(new ForeachWriter[Row] {
          override def open(partitionId: Long, version: Long): Boolean = {
            println(s"partitionId=$partitionId,version=$version")
            true

          }

          override def process(value: Row): Unit = {
            println(value)
          }

          override def close(errorOrNull: Throwable): Unit = {
            println("close")
          }
        })
      .start()

3 自定义输入源

某些应用场景下我们可能需要自定义数据源,如业务中,需要在获取KafkaSource的同时,动态从缓存中或者http请求中加载业务数据,或者是其它的数据源等都可以参考规范自定义。自定义输入源需要以下步骤:

第一步:继承DataSourceRegister和StreamSourceProvider创建自定义Provider类

第二步:重写DataSourceRegister类中的shotName和StreamSourceProvider中的createSource以及sourceSchema方法

第三步:继承Source创建自定义Source类

第四步:重写Source中的schema方法指定输入源的schema

第五步:重写Source中的getOffest方法监听流数据

第六步:重写Source中的getBatch方法获取数据

第七步:重写Source中的stop方法用来关闭资源

3.1 创建CustomDataSourceProvider类

3.1.1 继承DataSourceRegister和StreamSourceProvider

要创建自定义的DataSourceProvider必须要继承位于org.apache.spark.sql.sources包下的DataSourceRegister以及该包下的StreamSourceProvider。如下所示:

class CustomDataSourceProvider extends DataSourceRegister
  with StreamSourceProvider
  with Logging {
      //Override some functions ……
  }

3.1.2 重写DataSourceRegister的shotName方法

该方法用来指定一个数据源的名字,用来想spark注册该数据源。如Spark内置的数据源的shotName:kafka

、socket、rate等,该方法返回一个字符串,如下所示:

  /**
    * 数据源的描述名字,如:kafka、socket
    *
    * @return 字符串shotName
    */
  override def shortName(): String = "custom"

3.1.3 重写StreamSourceProvider中的sourceSchema方法

该方法是用来定义数据源的schema,可以使用用户传入的schema,也可以根据传入的参数进行动态创建。返回值是个二元组(shotName,scheam),代码如下所示:

  /**
    * 定义数据源的Schema
    *
    * @param sqlContext   Spark SQL 上下文
    * @param schema       通过.schema()方法传入的schema
    * @param providerName Provider的名称,包名+类名
    * @param parameters   通过.option()方法传入的参数
    * @return 元组,(shotName,schema)
    */
  override def sourceSchema(sqlContext: SQLContext,
                            schema: Option[StructType],
                            providerName: String,
                            parameters: Map[String, String]): (String, StructType) = (shortName(),schema.get)

3.1.4 重写StreamSourceProvider中的createSource方法

通过传入的参数,来实例化我们自定义的DataSource,是我们自定义Source的重要入口的地方

/**
  * 创建输入源
  *
  * @param sqlContext   Spark SQL 上下文
  * @param metadataPath 元数据Path
  * @param schema       通过.schema()方法传入的schema
  * @param providerName Provider的名称,包名+类名
  * @param parameters   通过.option()方法传入的参数
  * @return 自定义source,需要继承Source接口实现
  **/

override def createSource(sqlContext: SQLContext,
                          metadataPath: String,
                          schema: Option[StructType],
                          providerName: String,
                          parameters: Map[String, String]): Source = new CustomDataSource(sqlContext,parameters,schema)

3.1.5 CustomDataSourceProvider.scala完整代码

package org.apache.spark.sql.structured.datasource.custom

import org.apache.spark.internal.Logging
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.execution.streaming.{Sink, Source}
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider, StreamSourceProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StructType

/**
  * @author : shirukai
  * @date : 2019-01-25 17:49
  *       自定义Structured Streaming数据源
  *
  *       (1)继承DataSourceRegister类
  *       需要重写shortName方法,用来向Spark注册该组件
  *
  *       (2)继承StreamSourceProvider类
  *       需要重写createSource以及sourceSchema方法,用来创建数据输入源
  *
  *       (3)继承StreamSinkProvider类
  *       需要重写createSink方法,用来创建数据输出源
  *
  *
  */
class CustomDataSourceProvider extends DataSourceRegister
  with StreamSourceProvider
  with StreamSinkProvider
  with Logging {


  /**
    * 数据源的描述名字,如:kafka、socket
    *
    * @return 字符串shotName
    */
  override def shortName(): String = "custom"


  /**
    * 定义数据源的Schema
    *
    * @param sqlContext   Spark SQL 上下文
    * @param schema       通过.schema()方法传入的schema
    * @param providerName Provider的名称,包名+类名
    * @param parameters   通过.option()方法传入的参数
    * @return 元组,(shotName,schema)
    */
  override def sourceSchema(sqlContext: SQLContext,
                            schema: Option[StructType],
                            providerName: String,
                            parameters: Map[String, String]): (String, StructType) = (shortName(),schema.get)

  /**
    * 创建输入源
    *
    * @param sqlContext   Spark SQL 上下文
    * @param metadataPath 元数据Path
    * @param schema       通过.schema()方法传入的schema
    * @param providerName Provider的名称,包名+类名
    * @param parameters   通过.option()方法传入的参数
    * @return 自定义source,需要继承Source接口实现
    **/

  override def createSource(sqlContext: SQLContext,
                            metadataPath: String,
                            schema: Option[StructType],
                            providerName: String,
                            parameters: Map[String, String]): Source = new CustomDataSource(sqlContext,parameters,schema)


  /**
    * 创建输出源
    *
    * @param sqlContext       Spark SQL 上下文
    * @param parameters       通过.option()方法传入的参数
    * @param partitionColumns 分区列名?
    * @param outputMode       输出模式
    * @return
    */
  override def createSink(sqlContext: SQLContext,
                          parameters: Map[String, String],
                          partitionColumns: Seq[String],
                          outputMode: OutputMode): Sink = new CustomDataSink(sqlContext,parameters,outputMode)
}

3.2 创建CustomDataSource类

3.2.1 继承Source创建CustomDataSource类

要创建自定义的DataSource必须要继承位于org.apache.spark.sql.sources包下的Source。如下所示:

class CustomDataSource(sqlContext: SQLContext,
                       parameters: Map[String, String],
                       schemaOption: Option[StructType]) extends Source
  with Logging {
  //Override some functions ……
}

3.2.2 重写Source的schema方法

指定数据源的schema,需要与Provider中的sourceSchema指定的schema保持一致,否则会报异常

  /**
    * 指定数据源的schema,需要与Provider中sourceSchema中指定的schema保持一直,否则会报异常
    * 触发机制:当创建数据源的时候被触发执行
    *
    * @return schema
    */
  override def schema: StructType = schemaOption.get

3.2.3 重写Source的getOffset方法

此方法是Spark不断的轮询执行的,目的是用来监控流数据的变化情况,一旦数据发生变化,就会触发getBatch方法用来获取数据。

  /**
    * 获取offset,用来监控数据的变化情况
    * 触发机制:不断轮询调用
    * 实现要点:
    * (1)Offset的实现:
    * 由函数返回值可以看出,我们需要提供一个标准的返回值Option[Offset]
    * 我们可以通过继承 org.apache.spark.sql.sources.v2.reader.streaming.Offset实现,这里面其实就是保存了个json字符串
    *
    * (2) JSON转化
    * 因为Offset里实现的是一个json字符串,所以我们需要将我们存放offset的集合或者case class转化重json字符串
    * spark里是通过org.json4s.jackson这个包来实现case class 集合类(Map、List、Seq、Set等)与json字符串的相互转化
    *
    * @return Offset
    */
  override def getOffset: Option[Offset] = ???

3.2.4 重写Source的getBatch方法

此方法是Spark用来获取数据的,getOffset方法检测的数据发生变化的时候,会触发该方法, 传入上一次触发时的end Offset作为当前batch的start Offset,将新的offset作为end Offset。

  /**
    * 获取数据
    *
    * @param start 上一个批次的end offset
    * @param end   通过getOffset获取的新的offset
    *              触发机制:当不断轮询的getOffset方法,获取的offset发生改变时,会触发该方法
    *
    *              实现要点:
    *              (1)DataFrame的创建:
    *              可以通过生成RDD,然后使用RDD创建DataFrame
    *              RDD创建:sqlContext.sparkContext.parallelize(rows.toSeq)
    *              DataFrame创建:sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true)
    * @return DataFrame
    */
  override def getBatch(start: Option[Offset], end: Offset): DataFrame = ???

3.2.5 重写Source的stop方法

用来关闭一些需要关闭或停止的资源及进程

  /**
    * 关闭资源
    * 将一些需要关闭的资源放到这里来关闭,如MySQL的数据库连接等
    */
  override def stop(): Unit = ???

3.2.6 CustomDataSource.scala完整代码

package org.apache.spark.sql.structured.datasource.custom

import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.streaming.{Offset, Source}
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, SQLContext}

/**
  * @author : shirukai
  * @date : 2019-01-25 18:03
  *       自定义数据输入源:需要继承Source接口
  *       实现思路:
  *       (1)通过重写schema方法来指定数据输入源的schema,这个schema需要与Provider中指定的schema保持一致
  *       (2)通过重写getOffset方法来获取数据的偏移量,这个方法会一直被轮询调用,不断的获取偏移量
  *       (3) 通过重写getBatch方法,来获取数据,这个方法是在偏移量发生改变后被触发
  *       (4)通过stop方法,来进行一下关闭资源的操作
  *
  */
class CustomDataSource(sqlContext: SQLContext,
                       parameters: Map[String, String],
                       schemaOption: Option[StructType]) extends Source
  with Logging {

  /**
    * 指定数据源的schema,需要与Provider中sourceSchema中指定的schema保持一直,否则会报异常
    * 触发机制:当创建数据源的时候被触发执行
    *
    * @return schema
    */
  override def schema: StructType = schemaOption.get

  /**
    * 获取offset,用来监控数据的变化情况
    * 触发机制:不断轮询调用
    * 实现要点:
    * (1)Offset的实现:
    * 由函数返回值可以看出,我们需要提供一个标准的返回值Option[Offset]
    * 我们可以通过继承 org.apache.spark.sql.sources.v2.reader.streaming.Offset实现,这里面其实就是保存了个json字符串
    *
    * (2) JSON转化
    * 因为Offset里实现的是一个json字符串,所以我们需要将我们存放offset的集合或者case class转化重json字符串
    * spark里是通过org.json4s.jackson这个包来实现case class 集合类(Map、List、Seq、Set等)与json字符串的相互转化
    *
    * @return Offset
    */
  override def getOffset: Option[Offset] = ???

  /**
    * 获取数据
    *
    * @param start 上一个批次的end offset
    * @param end   通过getOffset获取的新的offset
    *              触发机制:当不断轮询的getOffset方法,获取的offset发生改变时,会触发该方法
    *
    *              实现要点:
    *              (1)DataFrame的创建:
    *              可以通过生成RDD,然后使用RDD创建DataFrame
    *              RDD创建:sqlContext.sparkContext.parallelize(rows.toSeq)
    *              DataFrame创建:sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true)
    * @return DataFrame
    */
  override def getBatch(start: Option[Offset], end: Offset): DataFrame = ???

  /**
    * 关闭资源
    * 将一些需要关闭的资源放到这里来关闭,如MySQL的数据库连接等
    */
  override def stop(): Unit = ???
}

3.3 自定义DataSource的使用

自定义DataSource的使用与内置DataSource一样,只需要在format里指定一下我们的Provider类路径即可。如

    val source = spark
      .readStream
      .format("org.apache.spark.sql.kafka010.CustomSourceProvider")
      .options(options)
      .schema(schema)
      .load()

3.4 实现MySQL自定义数据源

此例子仅仅是为了演示如何自定义数据源,与实际业务场景无关。

3.4.1 创建MySQLSourceProvider.scala

package org.apache.spark.sql.structured.datasource

import org.apache.spark.internal.Logging
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.execution.streaming.{Sink, Source}
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider, StreamSourceProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StructType

/**
  * @author : shirukai
  * @date : 2019-01-25 09:10
  *       自定义MySQL数据源
  */
class MySQLSourceProvider extends DataSourceRegister
  with StreamSourceProvider
  with StreamSinkProvider
  with Logging {
  /**
    * 数据源的描述名字,如:kafka、socket
    *
    * @return 字符串shotName
    */
  override def shortName(): String = "mysql"


  /**
    * 定义数据源的Schema
    *
    * @param sqlContext   Spark SQL 上下文
    * @param schema       通过.schema()方法传入的schema
    * @param providerName Provider的名称,包名+类名
    * @param parameters   通过.option()方法传入的参数
    * @return 元组,(shotName,schema)
    */
  override def sourceSchema(
                             sqlContext: SQLContext,
                             schema: Option[StructType],
                             providerName: String,
                             parameters: Map[String, String]): (String, StructType) = {
    (providerName, schema.get)
  }

  /**
    * 创建输入源
    *
    * @param sqlContext   Spark SQL 上下文
    * @param metadataPath 元数据Path
    * @param schema       通过.schema()方法传入的schema
    * @param providerName Provider的名称,包名+类名
    * @param parameters   通过.option()方法传入的参数
    * @return 自定义source,需要继承Source接口实现
    */
  override def createSource(
                             sqlContext: SQLContext,
                             metadataPath: String, schema: Option[StructType],
                             providerName: String, parameters: Map[String, String]): Source = new MySQLSource(sqlContext, parameters, schema)

  /**
    * 创建输出源
    *
    * @param sqlContext       Spark SQL 上下文
    * @param parameters       通过.option()方法传入的参数
    * @param partitionColumns 分区列名?
    * @param outputMode       输出模式
    * @return
    */
  override def createSink(
                           sqlContext: SQLContext,
                           parameters: Map[String, String],
                           partitionColumns: Seq[String], outputMode: OutputMode): Sink = new MySQLSink(sqlContext: SQLContext,parameters, outputMode)
}

3.4.2 创建MySQLSource.scala

package org.apache.spark.sql.structured.datasource

import java.sql.Connection

import org.apache.spark.executor.InputMetrics
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
import org.apache.spark.sql.execution.streaming.{Offset, Source}
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.json4s.jackson.Serialization
import org.json4s.{Formats, NoTypeHints}


/**
  * @author : shirukai
  * @date : 2019-01-25 09:41
  */
class MySQLSource(sqlContext: SQLContext,
                  options: Map[String, String],
                  schemaOption: Option[StructType]) extends Source with Logging {

  lazy val conn: Connection = C3p0Utils.getDataSource(options).getConnection

  val tableName: String = options("tableName")

  var currentOffset: Map[String, Long] = Map[String, Long](tableName -> 0)

  val maxOffsetPerBatch: Option[Long] = Option(100)

  val inputMetrics = new InputMetrics()

  override def schema: StructType = schemaOption.get

  /**
    * 获取Offset
    * 这里监控MySQL数据库表中条数变化情况
    * @return Option[Offset]
    */
  override def getOffset: Option[Offset] = {
    val latest = getLatestOffset
    val offsets = maxOffsetPerBatch match {
      case None => MySQLSourceOffset(latest)
      case Some(limit) =>
        MySQLSourceOffset(rateLimit(limit, currentOffset, latest))
    }
    Option(offsets)
  }

  /**
    * 获取数据
    * @param start 上一次的offset
    * @param end 最新的offset
    * @return df
    */
  override def getBatch(start: Option[Offset], end: Offset): DataFrame = {

    var offset: Long = 0
    if (start.isDefined) {
      offset = offset2Map(start.get)(tableName)
    }
    val limit = offset2Map(end)(tableName) - offset
    val sql = s"SELECT * FROM $tableName limit $limit offset $offset"

    val st = conn.prepareStatement(sql)
    val rs = st.executeQuery()
    val rows: Iterator[InternalRow] = JdbcUtils.resultSetToSparkInternalRows(rs, schemaOption.get, inputMetrics) //todo 好用
    val rdd = sqlContext.sparkContext.parallelize(rows.toSeq)

    currentOffset = offset2Map(end)

    sqlContext.internalCreateDataFrame(rdd, schema, isStreaming = true)
  }

  override def stop(): Unit = {
    conn.close()
  }

  def rateLimit(limit: Long, currentOffset: Map[String, Long], latestOffset: Map[String, Long]): Map[String, Long] = {
    val co = currentOffset(tableName)
    val lo = latestOffset(tableName)
    if (co + limit > lo) {
      Map[String, Long](tableName -> lo)
    } else {
      Map[String, Long](tableName -> (co + limit))
    }
  }

  // 获取最新条数
  def getLatestOffset: Map[String, Long] = {
    var offset: Long = 0
    val sql = s"SELECT COUNT(1) FROM $tableName"
    val st = conn.prepareStatement(sql)
    val rs = st.executeQuery()
    while (rs.next()) {
      offset = rs.getLong(1)
    }
    Map[String, Long](tableName -> offset)
  }

  def offset2Map(offset: Offset): Map[String, Long] = {
    implicit val formats: AnyRef with Formats = Serialization.formats(NoTypeHints)
    Serialization.read[Map[String, Long]](offset.json())
  }
}

case class MySQLSourceOffset(offset: Map[String, Long]) extends Offset {
  implicit val formats: AnyRef with Formats = Serialization.formats(NoTypeHints)

  override def json(): String = Serialization.write(offset)
}

3.4.3 测试MySQLSource

package org.apache.spark.sql.structured.datasource

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}

/**
  * @author : shirukai
  * @date : 2019-01-25 15:12
  */
object MySQLSourceTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName)
      .master("local[2]")
      .getOrCreate()
    val schema = StructType(List(
      StructField("name", StringType),
      StructField("creatTime", TimestampType),
      StructField("modifyTime", TimestampType)
    )
    )
    val options = Map[String, String](
      "driverClass" -> "com.mysql.cj.jdbc.Driver",
      "jdbcUrl" -> "jdbc:mysql://localhost:3306/spark-source?useSSL=false&characterEncoding=utf-8",
      "user" -> "root",
      "password" -> "hollysys",
      "tableName" -> "model")
    val source = spark
      .readStream
      .format("org.apache.spark.sql.structured.datasource.MySQLSourceProvider")
      .options(options)
      .schema(schema)
      .load()

    import org.apache.spark.sql.functions._
    val query = source.writeStream.format("console")
      // 是否压缩显示
      .option("truncate", value = false)
      // 显示条数
      .option("numRows", 30)
      .option("checkpointLocation", "/tmp/temporary-" + UUID.randomUUID.toString)
      .start()
    query.awaitTermination()
  }
}

4 自定义输出源

相比较输入源的自定义性,输出源自定义的应用场景貌似更为常用。比如:数据写入关系型数据库、数据写入HBase、数据写入Redis等等。其实Structured提供的foreach以及2.4版本的foreachBatch方法已经可以实现绝大数的应用场景的,几乎是数据想写到什么地方都能实现。但是想要更优雅的实现,我们可以参考Spark SQL Sink规范,通过自定义的Sink的方式来实现。实现自定义Sink需要以下四个个步骤:

第一步:继承DataSourceRegister和StreamSinkProvider创建自定义SinkProvider类

第二步:重写DataSourceRegister类中的shotName和StreamSinkProvider中的createSink方法

第三步:继承Sink创建自定义Sink类

第四步:重写Sink中的addBatch方法

4.1 改写CustomDataSourceProvider类

4.1.1 新增继承StreamSinkProvider

在上面创建自定义输入源的基础上,新增继承StreamSourceProvider。如下所示:

class CustomDataSourceProvider extends DataSourceRegister
  with StreamSourceProvider
  with StreamSinkProvider
  with Logging {
      //Override some functions ……
  }

4.1.2 重写StreamSinkProvider中的createSink方法

通过传入的参数,来实例化我们自定义的DataSink,是我们自定义Sink的重要入口的地方

  /**
    * 创建输出源
    *
    * @param sqlContext       Spark SQL 上下文
    * @param parameters       通过.option()方法传入的参数
    * @param partitionColumns 分区列名?
    * @param outputMode       输出模式
    * @return
    */
  override def createSink(sqlContext: SQLContext,
                          parameters: Map[String, String],
                          partitionColumns: Seq[String],
                          outputMode: OutputMode): Sink = new CustomDataSink(sqlContext,parameters,outputMode)

4.2 创建CustomDataSink类

4.2.1 继承Sink创建CustomDataSink类

要创建自定义的DataSink必须要继承位于org.apache.spark.sql.sources包下的Sink。如下所示:

class CustomDataSink(sqlContext: SQLContext,
                     parameters: Map[String, String],
                     outputMode: OutputMode) extends Sink with Logging {
    // Override some functions
}

4.2.2 重写Sink中的addBatch方法

该方法是当发生计算时会被触发,传入的是一个batchId和dataFrame,拿到DataFrame之后,我们有三种写出方式,第一种是使用Spark SQL内置的Sink写出,如 JSON数据源、CSV数据源、Text数据源、Parquet数据源、JDBC数据源等。第二种是通过DataFrame的foreachPartition写出。第三种就是自定义SparkSQL的输出源然后写出。

/**
  * 添加Batch,即数据写出
  *
  * @param batchId batchId
  * @param data    DataFrame
  *                触发机制:当发生计算时,会触发该方法,并且得到要输出的DataFrame
  *                实现摘要:
  *                1. 数据写入方式:
  *                (1)通过SparkSQL内置的数据源写出
  *                我们拿到DataFrame之后可以通过SparkSQL内置的数据源将数据写出,如:
  *                JSON数据源、CSV数据源、Text数据源、Parquet数据源、JDBC数据源等。
  *                (2)通过自定义SparkSQL的数据源进行写出
  *                (3)通过foreachPartition 将数据写出
  */
override def addBatch(batchId: Long, data: DataFrame): Unit = ???

注意

当我们使用第一种方式的时候要注意,此时拿到的DataFrame是一个流式的DataFrame,即isStreaming=ture,通过查看KafkaSink,如下代码所示,先是通过DataFrame.queryExecution执行查询,然后在wite里转成rdd,通过rdd的foreachPartition实现。同样的思路,我们可以利用这个rdd和schema,利用sqlContext.internalCreateDataFrame(rdd, data.schema)重新生成DataFrame,这个在MySQLSink中使用过。

override def addBatch(batchId: Long, data: DataFrame): Unit = {
  if (batchId <= latestBatchId) {
    logInfo(s"Skipping already committed batch $batchId")
  } else {
    KafkaWriter.write(sqlContext.sparkSession,
      data.queryExecution, executorKafkaParams, topic)
    latestBatchId = batchId
  }
}

  def write(
      sparkSession: SparkSession,
      queryExecution: QueryExecution,
      kafkaParameters: ju.Map[String, Object],
      topic: Option[String] = None): Unit = {
    val schema = queryExecution.analyzed.output
    validateQuery(schema, kafkaParameters, topic)
    queryExecution.toRdd.foreachPartition { iter =>
      val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
      Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
        finallyBlock = writeTask.close())
    }
  }

4.3 自定义DataSink的使用

自定义DataSink的使用与自定义DataSource的使用相同,在format里指定一些类Provider的类路径即可。

val query = source.groupBy("creatTime").agg(collect_list("name")).writeStream
      .outputMode("update")
      .format("org.apache.spark.sql.kafka010.CustomDataSourceProvider")
      .option(options)
      .start()
    query.awaitTermination()

4.4 实现MySQL自定义输出源

4.4.1 modify MySQLSourceProvider.scala

Above we realize MySQL custom input source when MySQLSourceProvider class has been created, we need to add on this basis inherited StreamSinkProvider, and rewrite createSink method, as follows:

package org.apache.spark.sql.structured.datasource

import org.apache.spark.internal.Logging
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.execution.streaming.{Sink, Source}
import org.apache.spark.sql.kafka010.{MySQLSink, MySQLSource}
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider, StreamSourceProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StructType

/**
  * @author : shirukai
  * @date : 2019-01-25 09:10
  *       自定义MySQL数据源
  */
class MySQLSourceProvider extends DataSourceRegister
  with StreamSourceProvider
  with StreamSinkProvider
  with Logging {
      
  //……省略自定义输入源的方法

  /**
    * 创建输出源
    *
    * @param sqlContext       Spark SQL 上下文
    * @param parameters       通过.option()方法传入的参数
    * @param partitionColumns 分区列名?
    * @param outputMode       输出模式
    * @return
    */
  override def createSink(
                           sqlContext: SQLContext,
                           parameters: Map[String, String],
                           partitionColumns: Seq[String], outputMode: OutputMode): Sink = new MySQLSink(sqlContext: SQLContext,parameters, outputMode)
}

4.4.1 Creating MySQLSink.scala

package org.apache.spark.sql.structured.datasource

import org.apache.spark.internal.Logging
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.streaming.OutputMode

/**
  * @author : shirukai
  * @date : 2019-01-25 17:35
  */
class MySQLSink(sqlContext: SQLContext,parameters: Map[String, String], outputMode: OutputMode) extends Sink with Logging {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val query = data.queryExecution
    val rdd = query.toRdd
    val df = sqlContext.internalCreateDataFrame(rdd, data.schema)
    df.show(false)
    df.write.format("jdbc").options(parameters).mode(SaveMode.Append).save()
  }
}

4.2.3 Test MySQLSink

package org.apache.spark.sql.structured.datasource

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}

/**
  * @author : shirukai
  * @date : 2019-01-29 09:57
  *       测试自定义MySQLSource
  */
object MySQLSourceTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName)
      .master("local[2]")
      .getOrCreate()
    val schema = StructType(List(
      StructField("name", StringType),
      StructField("creatTime", TimestampType),
      StructField("modifyTime", TimestampType)
    )
    )
    val options = Map[String, String](
      "driverClass" -> "com.mysql.cj.jdbc.Driver",
      "jdbcUrl" -> "jdbc:mysql://localhost:3306/spark-source?useSSL=false&characterEncoding=utf-8",
      "user" -> "root",
      "password" -> "hollysys",
      "tableName" -> "model")
    val source = spark
      .readStream
      .format("org.apache.spark.sql.structured.datasource.MySQLSourceProvider")
      .options(options)
      .schema(schema)
      .load()

    import org.apache.spark.sql.functions._
    val query = source.groupBy("creatTime").agg(collect_list("name").cast(StringType).as("names")).writeStream
      .outputMode("update")
      .format("org.apache.spark.sql.structured.datasource.MySQLSourceProvider")
      .option("checkpointLocation", "/tmp/MySQLSourceProvider11")
      .option("user","root")
      .option("password","hollysys")
      .option("dbtable","test")
      .option("url","jdbc:mysql://localhost:3306/spark-source?useSSL=false&characterEncoding=utf-8")
      .start()

    query.awaitTermination()
  }
}

3 summary

Through the above notes, see the official website of the document, you can learn to Structured supports several input sources: File Source, Socket Source, Rate Source, Kafka Source, usually we will use KafkaSource and FileSource, SocketSource, RateSource used for testing scenarios. There's nothing elegant about the input source operation can only be achieved by rewriting Source. For the output source is, Spark Structured provided foreach and foreachBatch already applicable to most scenes, there is no need to rewrite the Sink. About Spark SQL custom input source, Streaming custom data source late will slowly sorted out.

Guess you like

Origin blog.csdn.net/shirukai/article/details/86687672