[Spark] (task7) Introduction to PySpark Streaming

1. Introduction to Spark Streaming

insert image description here
Spark streaming can receive real-time input data streams (such as kafka, HDFS, TCP socket data streams, etc. in the figure above) for high-throughput, fault-tolerant data processing, and then push the processed data to the HDFS file system, database or dashboards. Machine learning or graph processing algorithms can be used in spark streaming. The general process of spark streaming is as follows:
insert image description here
For real-time computing of streaming data on a transaction system, there are three mainstream stream computing tools:

  • Storm has the lowest latency, generally a few milliseconds to tens of milliseconds, but the data throughput is low, and the number of events that can be processed per second is about hundreds of thousands, and the construction cost is high.

  • Flink is currently the main stream computing tool used by domestic Internet manufacturers. The delay is generally tens to hundreds of milliseconds, the data throughput is very high, the events that can be processed per second can reach hundreds of millions, and the construction cost is low.

  • Spark supports stream computing through [Spark Streaming] or [Spark Structured Streaming].

    • However, Spark's stream computing divides the stream data into mini-batches one by one according to time for processing, and the delay is generally about 1 second. The throughput is comparable to Flink.
    • Spark Structured Streaming now also supports the Continous Streaming mode, that is, the calculation is performed when the data arrives, but it is still in the testing stage and not particularly mature.

Second, the difference between Streaming and Structured Streaming

2.1 Streaming and Batch Computing

Batch computing or batch processing is the processing of offline data. A single processing data volume is large, and the processing speed is relatively slow.

Stream computing is the processing of data generated online in real time. The amount of data processed at a time is small, but the processing speed is faster.

2.2 Spark Streaming 和 Spark Structured Streaming

  • Before Spark 2.0, Spark Streaming was mainly used to support stream computing. Its data structure model is DStreamactually an RDD queue composed of small batches of data.

  • At present, Spark's main recommended stream computing module is Structured Streaming, and its data structure model is Unbounded DataFramea data table without boundaries.

    • Compared with Spark Streaming, which is built on the RDD data structure, Structured Streaming is based on SparkSQL. Most of the APIs of DataFrame can also be used in stream computing, realizing the integration of stream computing and batch processing. Optimized, has better performance, and is more fault-tolerant.

3. Statistical text numbers based on Spark Streaming

Scenario: In the data server, the number of texts is counted based on the data received by the tcp socket.

(1) Create StreamingContext.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

(2) Create a DStream and specify the ip and port of localhost (that is, determine the socket).

# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

Each record of DStream is a line of text, and now it needs to be separated into words according to the spaces between each word in the sentence:

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

The above is just the transformation operation, the next is the action part:

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

The complete code is as follows:

r"""
 Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 Usage: network_wordcount.py <hostname> <port>
   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.

 To run this on your local machine, you need to first run a Netcat server
    `$ nc -lk 9999`
 and then run the example
    `$ bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999`
"""
import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
        sys.exit(-1)
    sc = SparkContext(appName="PythonStreamingNetworkWordCount")
    # 创建StreamingContext
    ssc = StreamingContext(sc, 1)
    
    # 创建DStream,确定localhost和port
    lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
    counts = lines.flatMap(lambda line: line.split(" "))\
                  .map(lambda word: (word, 1))\
                  .reduceByKey(lambda a, b: a+b)
    counts.pprint()
    
    # action部分
    ssc.start()
    ssc.awaitTermination()

PS: Before running the above code, you need to run the Netcat data server, you can yum install -y ncdownload Netcat (a powerful network tool); you can nc -lk 9999 use it through testing, and then open a new terminal, nc ip:9999 .

4. Code Practice

  • Read the file https://cdn.coggle.club/Pokemon.csv as textFileStream
  • Use to filterfilter text where lines do not contain Grass
  • Use flatmapto split lines of text

Reference

[1] https://spark.apache.org/docs/latest/streaming-programming-guide.html

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/124365076