Structured Streaming Programming Guide-2.3.0

Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built into the Spark SQL engine. You can express your streaming computations in the same way that you express batch computations on static data. The Spark SQL engine will take care of running incrementally/continuously and update the final result as streaming data comes in. You can use the Dataset/DataFrame API in Scala/Java/Python/R to express stream healing, event-time windows, stream-to-batch joins, etc. Calculations are performed on the same optimized Spark SQL engine. Ultimately, the system guarantees end-to-end only one-time fault tolerance through checkpointing and write ahead logs. In summary, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end stream processing without the user having to think about streams.
Essentially, by default, Structured Streaming queries are processed using a mini-batch processing engine, processing the data stream as a series of mini-batch tasks resulting in 100ms end-to-end latency and just-once fault tolerance guarantees. However, starting with Spark 2.3, we have introduced a new low-latency processing mode, called continuous processing, which can achieve end-to-end latency as low as 1ms and guarantee at least once. There is no need to change the Dataset/DataFrame operations in your query, you can choose this mode based on your program needs.
In this guide, we'll walk you through programming patterns and APIs. We will mainly use the default mini-batch processing mode to explain the concepts, and then discuss the continuous processing mode. First, let's start with a simple example of a structured streaming query, streaming word counts.
Quick Example
Let's assume you want to maintain a running word count task with text data coming from a data service listening on a TCP socket. Let's see how you can express it using Structured Streaming. You can see all the code in Scala/Java/Python/R languages. If you downloaded Spark, you can run the examples directly. Anyway, let's walk through the example step by step to understand how it works. First, we have to import the necessary classes to create a local SparkSession, starting point for all functions related to Spark.

import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;
import java.util.Arrays;
import java.util.Iterator;
SparkSession spark=SparkSession.builder().appName("JavaStructuredNetworkWordCount").getOrCreate();

Next, create a streaming DataFrame representing text data received from a service listening on local port 9999, then convert the DataFrame to calculate word counts.

//create DataFrame representing the stream of input lines from connecting to localhost:9999
Dataset<Row> lines=spark.readStream().format("socket").option("host","localhost").option("port",9999).load();
//split the lines into words
Dataset<String> words=lines.as(Encoders.String()).flatMap((FlatMapFunction<String,String>) x -> Arrays.asList(x.split(" ")).iterator(),Encoders.String());
//generate running word count
Dataset<Row> wordCounts=words.groupBy("value").count();

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325527752&siteId=291194637