One
- Spark Streaming introduced
New scene requirements
● Cluster monitoring
Generally, large-scale clusters and platforms need to be monitored.
To monitor for various databases, including MySQL, HBase, etc.
To monitor applications, such as Tomcat, Nginx, Node.js, etc.
To monitor some indicators of hardware, such as CPU, memory, disk, etc.
Many more
two
- Introduction to Spark Streaming
● Official website
http://spark.apache.org/streaming/
● Overview
Spark Streaming is a real-time computing framework based on Spark Core . It can consume data from many data sources and process the data in real time. It features high throughput and strong fault tolerance.
three
● Characteristics of Spark Streaming
1. Easy to use
You can write streaming programs just like writing offline batches, and support java / scala / python language.
2. Fault tolerance
SparkStreaming can recover lost work without additional code and configuration.
3. Easy to integrate into Spark system
Stream processing is combined with batch processing and interactive queries.
- Calculate your location in real time
four
- Spark Streaming principle
- Overall process
In Spark Streaming, there will be a receiver component Receiver, which runs on an Executor as a long-running task. Receiver receives external data stream to form input DStream
DStream will be divided into batches of RDDs according to the time interval . When the batch processing interval is shortened to seconds, it can be used to process real-time data streams. The size of the time interval can be specified by parameters , generally set between 500 milliseconds and a few seconds .
Manipulating DStream is manipulating RDD , and the result of calculation processing can be transmitted to external system.
Spark Streaming workflow as shown in the following figures the same, then receive after real-time data to the data in batches, and then passed Engine the Spark (engine) processing result generated the final batch
- Data abstraction
The basic abstraction of Spark Streaming is DStream (Discretized Stream, discrete data stream, continuous data stream), which represents the continuous input data stream and the resulting data stream after various Spark operator operations
● DStream can be understood in depth from the following angles
1. DStream is essentially a series of continuous RDDs in time
2. The operation of DStream data is also performed in units of RDD
3. Fault tolerance
There is a dependency relationship between the underlying RDD, DStream directly also has a dependency relationship, RDD has fault tolerance, then DStream also has fault tolerance
As shown in the figure: each ellipse represents an RDD
Each circle in the ellipse represents a Partition partition in an RDD
Multiple RDDs in each column represent a DStream (there are three columns in the figure so there are three DStreams)
The last RDD in each row represents the intermediate result RDD produced by each Batch Size
4. Quasi real-time / near real-time
Spark Streaming decomposes streaming computing into multiple Spark Jobs. For each time period, data processing will go through Spark DAG graph decomposition and Spark task set scheduling process.
For the current version of Spark Streaming, the minimum batch size is between 0.5 and 5 seconds
Therefore, Spark Streaming can meet streaming quasi-real-time computing scenarios, but it is not suitable for high-frequency real-time transaction scenarios that have very high real-time requirements
● Summary
To put it simply, DStream is the encapsulation of RDD. When you operate DStream, you operate RDD.
For DataFrame / DataSet / DStream, it can be understood as RDD in essence
Fives
- DStream related operations
- Transformations
● Common Transformation --- Stateless transformation: the processing of each batch does not depend on the data of the previous batch
Transformation
Meaning
map(func)
Perform func function operations on each element in DStream, and then return a new DStream
flatMap(func)
Similar to the map method, except that each input item can be output as zero or more output items
filter (func)
Filter out all DStream elements whose function func returns are true and return a new DStream
union(otherStream)
Combine the source DStream with the input parameter otherDStream and return a new DStream.
reduceByKey(func, [numTasks])
Use the func function to aggregate the keys in the source DStream, and then return the new ( K, V ) pair of DStream
join(otherStream, [numTasks])
Input ( K, V), ( K, W is ) type DStream, returns a new ( K, ( V, W is ) type DSTREAM
transform(func)
The RDD-to-RDD function acts on each RDD in the DStream and can be any RDD operation, thereby returning a new RDD
● Special Transformations --- stateful transformation: the processing of the current batch needs to use the data or intermediate results of the previous batch .
Stateful transitions include transitions based on tracking state changes (updateStateByKey) and sliding window transitions
1.UpdateStateByKey(func)
2. Window Operations
- Output/Action
Output Operations can output DStream data to an external database or file system
When an Output Operations is called , the spark streaming program will start the real calculation process (similar to the action of RDD)
Output Operation
Meaning
print()
Print to console
saveAsTextFiles(prefix, [suffix])
The content of the saved stream is a text file, and the file name is "prefix-TIME_IN_MS [.suffix]".
saveAsObjectFiles(prefix,[suffix])
The content of the saved stream is SequenceFile, and the file name is "prefix-TIME_IN_MS [.suffix]".
saveAsHadoopFiles(prefix,[suffix])
The content of the saved stream is a hadoop file, and the file name is "prefix-TIME_IN_MS [.suffix]".
foreachRDD(func)
Run func on each RDD in Dstream
- to sum up
-