Introduction to SparkStreaming

One 

  • Spark Streaming introduced

New scene requirements

● Cluster monitoring

Generally, large-scale clusters and platforms need to be monitored.

To monitor for various databases, including MySQL, HBase, etc.

To monitor applications, such as Tomcat, Nginx, Node.js, etc.

To monitor some indicators of hardware, such as CPU, memory, disk, etc.

Many more

 

two 

 

  • Introduction to Spark Streaming

● Official website

http://spark.apache.org/streaming/

● Overview

Spark Streaming is a real-time computing framework based on Spark Core . It can consume data from many data sources and process the data in real time. It features high throughput and strong fault tolerance.

three

● Characteristics of Spark Streaming

1. Easy to use

You can write streaming programs just like writing offline batches, and support java / scala / python language.

2. Fault tolerance

SparkStreaming can recover lost work without additional code and configuration.

3. Easy to integrate into Spark system

Stream processing is combined with batch processing and interactive queries.

  1. Calculate your location in real time
  2.                                           

 

four

  • Spark Streaming principle
  1.  Overall process

In Spark Streaming, there will be a receiver component Receiver, which runs on an Executor as a long-running task. Receiver receives external data stream to form input DStream

 

DStream will be divided into batches of RDDs according to the time interval . When the batch processing interval is shortened to seconds, it can be used to process real-time data streams. The size of the time interval can be specified by parameters , generally set between 500 milliseconds and a few seconds .

Manipulating DStream is manipulating RDD , and the result of calculation processing can be transmitted to external system.

Spark Streaming workflow as shown in the following figures the same, then receive after real-time data to the data in batches, and then passed Engine the Spark (engine) processing result generated the final batch

  1. Data abstraction

The basic abstraction of Spark Streaming is DStream (Discretized Stream, discrete data stream, continuous data stream), which represents the continuous input data stream and the resulting data stream after various Spark operator operations

DStream can be understood in depth from the following angles

1. DStream is essentially a series of continuous RDDs in time

2. The operation of DStream data is also performed in units of RDD

3. Fault tolerance

There is a dependency relationship between the underlying RDD, DStream directly also has a dependency relationship, RDD has fault tolerance, then DStream also has fault tolerance

As shown in the figure: each ellipse represents an RDD

Each circle in the ellipse represents a Partition partition in an RDD

Multiple RDDs in each column represent a DStream (there are three columns in the figure so there are three DStreams)

The last RDD in each row represents the intermediate result RDD produced by each Batch Size

4. Quasi real-time / near real-time

Spark Streaming decomposes streaming computing into multiple Spark Jobs. For each time period, data processing will go through Spark DAG graph decomposition and Spark task set scheduling process.

For the current version of Spark Streaming, the minimum batch size is between 0.5 and 5 seconds

Therefore, Spark Streaming can meet streaming quasi-real-time computing scenarios, but it is not suitable for high-frequency real-time transaction scenarios that have very high real-time requirements

● Summary

To put it simply, DStream is the encapsulation of RDD. When you operate DStream, you operate RDD.

For DataFrame / DataSet / DStream, it can be understood as RDD in essence

Fives

  •  DStream related operations

      1.  Transformations

● Common Transformation --- Stateless transformation: the processing of each batch does not depend on the data of the previous batch

Transformation

Meaning

map(func)

Perform func function operations on each element in DStream, and then return a new DStream

flatMap(func)

Similar to the map method, except that each input item can be output as zero or more output items

filter (func)

Filter out all DStream elements whose function func returns are true and return a new DStream

union(otherStream)

Combine the source DStream with the input parameter otherDStream and return a new DStream.

reduceByKey(func, [numTasks])

Use the func function to aggregate the keys in the source DStream, and then return the new ( K, V ) pair of DStream

join(otherStream, [numTasks])

Input ( K, V), ( K, W is ) type DStream, returns a new ( K, ( V, W is ) type DSTREAM

transform(func)

The RDD-to-RDD function acts on each RDD in the DStream and can be any RDD operation, thereby returning a new RDD

 

● Special Transformations --- stateful transformation: the processing of the current batch needs to use the data or intermediate results of the previous batch .

Stateful transitions include transitions based on tracking state changes (updateStateByKey) and sliding window transitions

1.UpdateStateByKey(func)

2.  Window Operations

 

      1.  Output/Action

Output Operations can output DStream data to an external database or file system

When an Output Operations is called , the spark streaming program will start the real calculation process (similar to the action of RDD)

Output Operation

Meaning

print()

Print to console

saveAsTextFiles(prefix, [suffix])

The content of the saved stream is a text file, and the file name is "prefix-TIME_IN_MS [.suffix]".

saveAsObjectFiles(prefix,[suffix])

The content of the saved stream is SequenceFile, and the file name is "prefix-TIME_IN_MS [.suffix]".

saveAsHadoopFiles(prefix,[suffix])

The content of the saved stream is a hadoop file, and the file name is "prefix-TIME_IN_MS [.suffix]".

foreachRDD(func)

Run func on each RDD in Dstream

  1. to sum up
  2. ​​​​​​​
Published 258 original articles · Like 364 · Visits 340,000+

Guess you like

Origin blog.csdn.net/bbvjx1314/article/details/105515871