Spark2.0 uses socket to receive data and process it

Suppose you want to listen to a TCP Socket on a data server to get a constant stream of data, and you want to count the number of words in real time.

object SocketComplete {
  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.WARN)
    //First, we need to import the necessary classes and create a SparkSession that runs locally, which is the program entry linked to Spark.
    val spark = SparkSession.builder.
      appName("Spark shell").
      getOrCreate()
    import spark.implicits._
    //Receive data type is socket, IP is slave6, port number is 8008
    val lines: DataFrame = spark.readStream.format("socket").option("host", "192.168.0.56").option("port", 8008).load()
    //This Lines DataFrame represents an unbounded table containing streaming data. This table contains a column of strings named value, and each row in the data stream becomes a Row in the table.
    // Note that no data has been received yet, it just defines the process of the conversion operation, and does not start the conversion operation.
    // Next we use lines.as[String] to convert the DataFrame to a Dataset of type String
    // (Note: APIs of DataFrame and DataSet in Spark2.0 are integrated together, DataFrame exists as a special case of Dataset, DataFrame=Dataset[Row]),
    // So we can apply the flatMap operation to the Dataset to split a row of data into a collection of words,
    // The result set words Dataset contains all the words. Finally, we group the words by groupingby(vlaue) and then use count to count them.
    // Note that this is a streaming DataFrame representing counting the number of words in a continuous stream of text data.
    val words: Dataset[String] = lines.as[String].flatMap(_.split(" "))
    // The result set words Dataset contains all the words. Finally, we group the words by groupingby(vlaue) and then use count to count them.
    // Note that this is a streaming DataFrame representing counting the number of words in a continuous stream of text data.
    val wordCounts: DataFrame = words.groupBy("value").count()
    //We will print the full amount of results to the console, (there are three modes for the output operation of the result set, complete mode, append mode and update mode,
    // Will be introduced in detail later, specify the mode to use outputMode("complete")), every time the result set is updated, all the results will be printed once.
    // Use start() to start the streaming data calculation process.
    val query: StreamingQuery = wordCounts.writeStream.outputMode("complete").format("console").start()
    query.awaitTermination()

  }
}



 
 
OutPut can define different storage methods, there are three kinds as follows:

1: Complete Mode - The entire updated result set is written to external storage. The write operation of the whole table will be done by the connector of the external storage system.

2: Append Mode - When the time interval is triggered, only the newly added data rows in the Result Table will be written to the external storage. This method is only suitable for the situation that the existing content in the result set does not want to be changed. If the existing data will be updated, this method is not suitable.

3: Update Mode – When the time interval is triggered, only the updated data in the Result Table will be written to the external storage system (not yet available in Spark 2.0). Note that the difference from the Complete Mode method is that the result set that is not updated will not be written to external storage. */

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325665588&siteId=291194637