spark KafkaUtils.createDirectStream little understanding

Reference article:

https://www.cnblogs.com/runnerjack/p/8597981.html

https://blog.csdn.net/qq_41083134/article/details/99561175

 

First, based on the way Receiver

Receiver used in this way to get data. Receiver is to use Kafka's high-level Consumer API to achieve. receiver obtained from Kafka data are stored in the Spark Executor of memory (if a sudden surge of data, a large batch accumulation, the problem is prone to memory overflow), then Spark Streaming will start the job to deal with that data.

However, in the default configuration, this approach may fail because the underlying data loss. If you want to enable highly reliable mechanism for zero data loss, you must enable the pre-Spark Streaming write logging mechanism (Write Ahead Log, WAL). This mechanism will be pre-on the received write data Kafka synchronization distributed file systems (such as HDFS) in the write log. Therefore, even if the underlying node failure occurs, the data can also be used in write-ahead log to recover.

Need points to pay attention

Second, based on Direct way

This new Receiver is not based on the direct way, it was introduced in Spark 1.3 in order to be able to ensure a more robust mechanism. Alternatively off using the Receiver to receive the data, this approach will periodically query Kafka, to obtain each new topic + Partition the offset, thereby defining a range of offset for each batch. When the data processing job starts, Kafka simple consumer will use to obtain the api Kafka offset specified range of data.

Advantages as follows

Parallel reading simplified: To read a plurality of partition, and then need to create a plurality of input DStream them union operation. Spark Kafka partition is created with as much RDD partition, and can read data from Kafka in parallel. So between Kafka partition and RDD partition, there is a one to one mapping relationship.

High performance: If you want to ensure zero data loss, based on the receiver in a way, it is necessary to open WAL mechanism. In fact, this way is inefficient, because the data is actually copied two, Kafka themselves have highly reliable mechanism, will copy the data, and here will copy to the WAL. And based on the direct way, do not rely Receiver, no need to open WAL mechanism, just copy the data made Kafka, then it can be restored by a copy of Kafka's.

Once and only once the transaction mechanism .

Three , comparison:

Receiver-based way is to use a high-level API to save Kafka had offset consumption in the ZooKeeper. This is the traditional way to consume Kafka data. In this way combined with the WAL mechanism to ensure zero data loss and high reliability, but it can not guarantee that the data is processed once and only once, may be processed twice. Because there may not be synchronized between the Spark and ZooKeeper.

Based on direct manner using kafka simple api, Spark Streaming himself responsible for tracking consumption offset, and stored in the checkpoint. Spark it must be synchronized so that data can be guaranteed once and consumer consumption only once.

In the actual production environment most of the way with Direct

Published 144 original articles · won praise 36 · Views 100,000 +

Guess you like

Origin blog.csdn.net/qq_36951116/article/details/104400179