kafka contrast mode

Receiver is to use Kafka's high-level Consumer API to achieve. receiver obtained from Kafka data are stored in the memory of the Spark Executor, then Spark Streaming will start the job to deal with that data. However, in the default configuration, this approach may fail because the underlying data loss. If you want to enable highly reliable mechanism for zero data loss, you must enable the pre-Spark Streaming write logging mechanism (Write Ahead Log, WAL). This mechanism will be pre-on the received write data Kafka synchronization distributed file systems (such as HDFS) in the write log. Therefore, even if the underlying node failure occurs, you can use the data in the write-ahead log to recover, but efficiency drops.

This direct approach will periodically query Kafka, each topic + to get the latest offset partition, so the definition of offset of each batch. When the data processing job starts, Kafka simple consumer will use to obtain the api Kafka offset specified range of data.
This embodiment has the following advantages:
1, a simplified parallel Read: To read a plurality of Partition, and then need to create a plurality of input DStream them union operation. Spark Kafka partition is created with as much RDD partition, and can read data from Kafka in parallel. So between Kafka partition and RDD partition, there is a one to one mapping relationship.


2. High performance: If you want to ensure zero data loss, based on the receiver in a way, it is necessary to open WAL mechanism. In fact, this way is inefficient, because the data is actually copied two, Kafka themselves have highly reliable mechanism, will copy the data, and here will copy to the WAL. And based on the direct way, do not rely Receiver, no need to open WAL mechanism, just copy the data made Kafka, then it can be restored by a copy of Kafka's.

3, once and only once the transaction mechanism:
based receiver the way, is to use Kafka's high-level API to save post-consumer offset in the ZooKeeper. This is the traditional way to consume Kafka data. In this way combined with the WAL mechanism to ensure zero data loss and high reliability, but it can not guarantee that the data is processed once and only once, may be processed twice. Because there may not be synchronized between the Spark and ZooKeeper.
Based on direct manner using kafka simple api, Spark Streaming himself responsible for tracking consumption offset, and stored in the checkpoint. Spark it must be synchronized so that data can be guaranteed once and consumer consumption only once.

Guess you like

Origin www.cnblogs.com/Mr--zhao/p/11275696.html