SSS-Understanding and getting started with Spark Structured Streaming

Foreword: I
have been busy writing projects recently. I haven't written a blog for two days. I took the time to write a blog today. It just happened that I have been using Spark's thing recently, and it can be counted as a note; alas~ Learning Structured Streaming The road can be said to be difficult, because this thing is a concept that Spark hasn't quit for long. There are basically no tutorials on the Internet, and there is only an example of WordCount in the official document; the blogger encountered problems when learning this thing. , And basically ask questions on StackOverflow, hahaha, the people on it are very interesting and nice, each is a talent. It is still necessary to mention that it is recommended to use the latest version of Spark, after all, this thing is now It's not perfect yet, and it's always being updated. Many methods and functions are only available in the new version. Finally, let's enter today's topic.

1. What is Structured Streaming

  • Explanation of the official website:
    Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express stream calculations like batch calculations on static data. When the streaming data continues to arrive, the Spark SQL engine will be responsible for running it incrementally and continuously and updating the final result. You can use the Dataset/DataFrame API in Scala, Java, Python or R to represent stream aggregation, event time windows, stream-to-batch connections, etc. The calculations are executed on the same optimized Spark SQL engine. Finally, the system guarantees end-to-end fault tolerance through checkpoints and pre-write logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end accurate one-time stream processing without users needing to reason about the stream.
    Internally, by default, structured stream queries are processed using a micro-batch processing engine, which processes the data stream as a series of small batch jobs, thus achieving end-to-end latency as low as 100 milliseconds and a precise Fault tolerance guarantee. However, starting with Spark 2.3, we have introduced a new low-latency processing mode called "continuous processing", which can achieve a minimum guaranteed end-to-end delay as low as 1 millisecond. Without changing the Dataset/DataFrame operation in the query, you will be able to select the mode according to the application requirements.
  • My understanding:
    Structured Streaming can know from its first word'Structured' that it is structured and what is structured. It tells us that there is a certain amount of data stored in relational databases. When the mode restricts it, we call this data structured data. What is unstructured data, such as usual browsing records, server log data, and their data format does not have a specific pattern. It is called unstructured data; after talking so much, you can understand structured data as a table.
    Let's look at the second word'Streaming' in Structured Streaming, which means stream, we have learned Java Everyone knows that when Java reads and writes files, it is also implemented based on streams, which are called IO streams. In terms of circulation, it is like water in a river, which is always flowing, just like the pipeline. Workers, you should understand this. Since our products are produced, they have been moving on the assembly line. Workers are doing fixed jobs on this assembly line and performing their duties. After the products are processed by this worker in this link, it It will be put on the return line again, and go to the next worker for the next processing. Finally, when all the processes are completed, the product will be completed. This is the concept of flow.
    Okay, back to Structured Streaming, its full name is structure Streaming, this product on our'production workshop' is replaced with structured data (table),
    structured data is the object we want to process, and each time it flows in is a table, this is the core of Structured Streaming; Another key difference from other stream processing concepts is that Structured Streaming supports event windows, that is, it can parse the data of a period of time into a table and pass it in, and then you can operate on this table.

2. Word Count getting started example

  • As the saying goes, it's useless to say that it's useless, show me your code and I will know, then let's start with the simplest Word Count to get started with structured flow programming
  • The entrance to Structured Streaming is simple and almost the same as Spark SQL
  • First of all, you must import the necessary packages, and initialize a Spark Session object.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark_session = SparkSession \
    .builder \
    .appName("WordCountTest") \
    .getOrCreate()
  • Create a DataFrame, here is slightly different from Spark SQL, here by reading data from the specified source, these sources can be, kafka, socket...
    we can specify the socket source here, because it is the simplest and convenient to test.
# 我们指定读入流数据,从socket中读取
source_df = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()
# 返回的是一个下面这种结构的 DataFrame
'''
+------------------------+
|         value          |
+------------------------+
|      socket data       |
+------------------------+
'''
  • With the DataFrame (table) we can use the static method in Spark SQL to query this DataFrame
# 这里先用split函数对每一行数据进行分割
'''
+--------------------+							       +-------------------------+
|       value        |   split(source_df.value,' ')	   |         value           |
+--------------------+  --------------------------->   +-------------------------+
|    Bob Nick Nina   |								   |  ['Bob','Nick','Nina']  |
+--------------------+								   +-------------------------+

 			     +-----------+
				 |   value   |
				 +-----------+
				 |    Bob    |
  explode() 	 +-----------+
------------->	 |    Nick   |
				 +-----------+
				 |    Nina   |
				 +-----------+		
最后再将 value列名 改名为word
'''
words = source_df.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)
# 通过对word字段进行分组,然后利用聚合函数count实现统计单词的个数
wordCounts = words.groupBy("word").count()
'''
+--------+-------+
|  word  | count |
+--------+-------+
|   Bob  |   1   |
+--------+-------+
|  Nick  |   1   |
+--------+-------+
|  Nina  |   1   |
+--------+-------+
'''
  • Then we need to specify an output stream for the above query, that is, to build the entire query into a complete query task
# 指定写入流,输出的模式为全模式(待会儿后面会讲),指定输出的目的地为控制台
query_task = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()
  • Finally, it is also the most important thing. The previous task was just built, but in fact it was still not running. The following methods can be used to actually run the above task.
# 这样程序就会阻塞在这里,一直等待 query_task 终止才会结束
query_task.awaitTermination()
  • The Word Count program has been written, but don't panic and run it. We have to prepare the socket data first. It is very simple to prepare the socket data. You can write one yourself through the socket module of python. Now there is an easier way to Use netcat command to test
  • Execute the command in the terminal:
nc -lp 9999
# 先不要关闭终端 等下会用
  • Run the Word Count program
  • And enter some sentences in the terminal where the netcat command was executed earlier
I love the world
I love life
Hello word
  • Then you will find the following in the console running the Word Count program:
------------------------------------
------------------------------------
batch 1
+----------------+-----------------+
|       word     |      count      |
+----------------+-----------------+
|       I        |        1        |
+----------------+-----------------+
|      love      |        1        |
+----------------+-----------------+
|      the       |        1        |
+----------------+-----------------+
|      world     |        1        |
+----------------+-----------------+

------------------------------------
------------------------------------
batch 2
+----------------+-----------------+
|       word     |      count      |
+----------------+-----------------+
|       I        |        2        |
+----------------+-----------------+
|      love      |        2        |
+----------------+-----------------+
|      the       |        1        |
+----------------+-----------------+
|      world     |        1        |
+----------------+-----------------+
|      life      |        1        |
+----------------+-----------------+

------------------------------------
------------------------------------
batch 3
+----------------+-----------------+
|       word     |      count      |
+----------------+-----------------+
|       I        |        2        |
+----------------+-----------------+
|      love      |        2        |
+----------------+-----------------+
|      the       |        1        |
+----------------+-----------------+
|      world     |        2        |
+----------------+-----------------+
|      life      |        1        |
+----------------+-----------------+
|      hello     |        1        |
+----------------+-----------------+

3. Programming Model

  • Basic concept:
    The core idea of ​​Structured Streaming is to turn a table into a dynamic table, that is to say, this table is stateless, and its number of rows is uncertain, and it will shift over time. The length will change, and it will add the data arriving later to the end of this table. We can do the same query work on this table as the static table, which is to do incremental query on this table;
    its schematic diagram as follows:
    Insert picture description here

It actually uses the trigger to null the entire window time.For example, when the trigger is triggered last time to the trigger this time, the data in the middle interval, Structured Streaming will query and combine this data. Add this piece of data as a new row to the end of the history table, and then output the merged result table (note that this is the case in Complete mode)
In Complete mode, our trigger is set to 1 second, Then its working diagram can be represented by the following diagram:
Insert picture description here

  • Three output modes:
    remember the word count program we ran before, where we specified a Complete mode, that is, full mode when building query_task. In this mode, each query will also check out the historical records and display A complete table, is this enough? Of course it is definitely not enough, so there are two other modes. Here are some overviews of these three modes:
    (1) complete (complete mode)-the entire updated result table Will be written to external storage. It is up to the storage connector to decide how to handle the writing of the entire table.
    (2) append (append mode)-Only write new rows appended to the result table since the last trigger to external storage. This only applies to queries where the existing rows in the expected result table will not change.
    (3) update (update mode)-only the rows that have been updated in the result table since the last trigger will be written to external storage (available since Spark 2.1.1). Note that this is different from the completion mode in this mode Only output rows that have changed since the last trigger. If the query does not contain aggregation, it will be equivalent to append mode.

  • Full mode explanation:
    Because our previous word count program runs according to the full mode, it should be relatively easy to understand
    Insert picture description here

The above is a schematic diagram of the operation of the word count program. The first DataFrame (source_df) I read from the socket stream is called the input table. Finally, the DataFrame (wordCounts) we get through grouping and aggregation is called the result. The table, the wordCounts DataFrame queried through source_df when it has not been started, has not changed. It should be a static DataFrame at this time. Once it runs, when new data comes, spark will do an incremental query on it. ', the query combines the previous running count with the new data to calculate the updated count. It should be noted that it reads the latest available data from the streaming data source, performs incremental processing on it to update the result, and then discards the source data. It only keeps the minimum intermediate state data required to update the result (for example, the intermediate count in the previous example)

4. Supported input sources

Spark Structured Streaming has four built-in input sources:

  • File Source:
    Read the file written in the directory as a data stream. The supported file formats are text, csv, json, orc, parquet. For the latest list and supported options for each file format, see the documentation for the DataStreamReader interface. Note that files must be placed atomically in a given directory. In most file systems, this can be achieved by file movement operations.
  • Kafka Source:
    Read data from Kafka, support Kafka version 0.10.0 or higher
  • Socket Source:
    This is generally used for testing, because it does not provide end-to-end fault tolerance guarantees
    to read UTF8 text data from the socket connection. The listening server socket is located at the driver.
  • Rate Source:
    Generate data at a specified number of rows per second, and each output row contains timestamp and value. where timestamp is the type that Timestamp contains the message distribution time, and the value is the type that Long contains the message count, starting from 0 in the first line. This source is intended for testing and benchmarking.

The following are some configuration parameters of various input sources:

Source Parameter configuration Whether to support fault tolerance
File Source • path: Enter the path of the directory, and it is common to all file formats.
• maxFilesPerTrigger: the maximum number of new files to be considered for each trigger (default: no max)
• latestFirst: whether to process the latest new files first, useful when there is a large backlog of files (default: false)
• fileNameOnly: whether according to The following content checks that the new file only has the file name instead of the full path (default value: false). When set to "true", the following files will be regarded as the same file because their file names "dataset.txt" are the same:
"file: ///dataset.txt"
"s3: // a/ dataset.txt "
" S3n: //a/b/dataset.txt "
" s3a: //a/b/c/dataset.txt "
Yes
Socket Source host: the host to be connected, necessary parameters
port: the port to be connected, necessary parameters
no
Kafka Source • kafka.bootstrap.servers configures the address of the Kafka cluster, necessary parameters
• subscribe topic name, necessary parameters
Yes
Rate Source • rowsPerSecond (for example, 100, default: 1): How many rows should be generated per second.
• rampUpTime (for example, 5s, default: 0s): how long it takes to ramp up before the generation speed becomes rowsPerSecond. Using a granularity finer than seconds will be forced into integer seconds.
• numPartitions (for example, 10, default: Spark's default parallelism): the partition number of the generated row.
The source will do its best to reach the target rowsPerSecond, but the query may be limited by resources, and numPartitions can be adjusted to help achieve the required speed.
Yes

5. Supported receivers

Spark Structured Streaming supports the following six receivers:

Sink Support output mode Configuration parameter Fault tolerance
File Sink append path: the address of the output file.
Supports writing to the partition table.
Can be partitioned according to time
Yes (but only once)
Kafka Sink append,
update,
complete
The parameters are similar to the previous input source Yes (at least once)
foreach sink append,
update,
complete
A custom class will be passed in, and then the process()
method in this class will pass in each row in the result table for processing
Yes (at least once)
foreachBatch Sink append,
update,
complete
This is similar to the previous one, except that the receiver needs to pass in a function object.
During the running process, spark will call this function by itself and pass the result table as a DataFrame.
Depends on internal implementation
console Sink append,
update,
complete
This receiver is the receiver we used earlier. It prints the data to the console.
numRows: The maximum table length displayed on the console, the default is 20
truncate: Whether to truncate the output, the default is True
no
Memory SInk append,
complete
The function of this receiver is to store the result table in memory.
In completion mode, the restarted query will recreate the complete table.
no

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105748883