[Large] [Spark] Streaming data base

I. Overview

Spark Streaming underlying data processing unit is: DStream; main process stream data (the data has been kept in the transmission program to Spark), may be combined where the SQL Spark Spark Core and to process the data, if the source data is unstructured data then here we can combine Spark Core to deal with, if the data is structured data, here we can combine Spark SQL for processing.

characteristic

  • It can be as easy as writing an offline batch processing to develop streaming program, and you can use java / scala / Python language code development
  • Fault-tolerant SparkStreaming without additional code and configuration of circumstances can recover lost work
  • Spark easily integrated into the system

And the difference between storm

  • Storm deals with every incoming event, Spark Streaming is handling events within a certain period of time window flow
  • Delay of general Storm is lower than the spark streaming, because Spark Streaming is small batch, batch generated by a long interval, a batch trigger a calculation, for example, I set up a program in which the interval is 5 seconds long, that is five sec data received trigger a calculation, Storm is the real-time processing to a data, trigger a calculation, it can be said to spark streaming flow calculation, Storm is calculated in real time by implementing Ali JStorm Trident, also supports small batch compute
  • Throughput: Storm certain slightly less in Spark Streaming, because one Storm receives source data from the spout assembly, sent by the transmitter to the bolt, bolt the received data, and thereafter the processed written to the external storage system or transmitted to the next bolt reprocessed so storm mobile data, not the mobile computing; Spark Streaming obtaining Task to be calculated data on which node, then TaskScheduler sends the task to the data processing on the corresponding node, the Spark Streaming mobile computing is not mobile data, mobile computing is the current mainstream design calculation engine; two reasons we can easily see, is a batch process, a real-time computing, batch throughput generally higher than the calculated real-time triggering of
  • storm is acker (ack / fail message acknowledgment.) acknowledgment mechanism to ensure that a tuple is completely processed, Spark Streaming is fault tolerant conversion logic in memory RDD, i.e. if the data set from the A data and B data set computing error, because the storage a calculation logic has to B, it can be generated from B a recalculation, fault tolerance is not the same, temporarily not good or bad
  • Spark Streaming Storm One thing is absolutely not match that spark provides a unified solution that can be calculated offline, flow calculation, chart calculation, machine learning in a cluster inside, but only a simple cluster of storm real-time computing

Two basic working principle

Receiving real-time input data stream, then the batch data is split into a plurality of, for example every 1 second to collect data encapsulated as a batch, and then each batch to Spark calculation engine for processing, the final result will be to produce a data stream wherein the data is one of a batch consisting of.
Here Insert Picture Description

Three DStream

DStream, English full Discretized Stream, the Chinese translation for "discrete flow", which represents a continuous data stream. DSTREAM may be created by the input data source, such as Kafka, Flume and Kinesis; can also be created by other applications DSTREAM order functions, such as map, reduce, join, window.
DStream application for operators, such as map, in fact, been translated into operation DStream at the bottom of each RDD will. For example, perform an operation on a map DStream, will produce a new DStream. However, at the bottom, in fact, the principle is, the input DStream RDD in each time period, are applied over the map operation, and then generate new RDD, that is, as a RDD that time period in the new DStream. RDD underlying transformation operation, in fact, still a Spark Core calculation engine to achieve. Spark Streaming been one of the Spark Core package hide the details, and then provides developers with easy-to-use high-level API.
Here Insert Picture Description

Four Kafka's Receiver and Direct way

  • Receiver-based way of
    using the Receiver in this way to get the data, if you want to enable highly reliable mechanism for zero data loss, you must enable the pre-Spark Streaming logging mechanism to write, but write-ahead log low efficiency. And can not guarantee that the data is processed once and only once.
  • Direct mode based on
    the new Receiver is not based on the direct way, it is 1.3 introduced in Spark, thereby ensuring a more robust mechanism. No need to open WAL mechanism, just copy the data made Kafka, then it can be restored by a copy of Kafka's. Can ensure consumer data once and only once consumption

Five caching and persistence

And RDD similar, Spark Streaming also allows developers to manually control the flow of data to persistent memory. Call persist () method DStream, you can let all RDD Spark Streaming automatically generated by the data stream are persistent in memory. If you want to perform multiple operations on a DStream, then, for DStream persistence it is very useful. Because multiple operations, you can use a shared data cache memory.

For window-based operations, such as reduceByWindow, reduceByKeyAndWindow, as well as state-based operations, such as updateStateByKey, it implicitly opens the default persistence mechanism. Spark Streaming i.e., the default data will Dstream produced in the above operation, the cache memory, the developer does not need to manually call persist () method.

For data received via the network input stream, such as socket, Kafka, Flume and the like, the default level of persistence, is to copy the data, in order to facilitate fault tolerance. Equivalent MEMORY_ONLY_SER_2.

The difference is that the RDD, the default persistence level, unity is to be serialized.

Six Checkpoint

The RDD checkpoint to the reliable storage system, can take a lot performance. When the RDD is checkpoint, it will lead to increased processing time of the batch. Therefore, checkpoint interval, requires careful setup. For many of those batch interval, such as one second, even if the implementation of checkpoint operations, will significantly reduce throughput. And on the other hand, if performed too infrequent checkpoint operations, it will lead to RDD's lineage becomes longer, the risk of failure and there will be a long recovery time.
For transformation operations that require the checkpoint state in the default checkpoint interval is typically multiple batch interval, is at least 10 seconds. DStream using the checkpoint () method, a long interval can be set DStream of this checkpoint. Typically, the checkpoint interval is set to 5 to 10 times the interval sliding window operation is a good choice.

Guess you like

Origin blog.csdn.net/cheidou123/article/details/94167432