Spark Streaming common big data interview questions

1. In what ways does SparkStreaming consume data in Kafka, and what is the difference between them?

1. Based on Receiver

  • This method uses Receiver to obtain data. Receiver is implemented using Kafka's high-level Consumer API. The data obtained by receiver from Kafka is stored in the memory of spark executor (if there is a sudden increase in data, a large number of batches accumulate, It is easy to have memory overflow problems), and then what data will the job started by spark streaming process
  • However, in the default configuration, this method may lose data due to the failure of the underlying layer. If you want to enable the high availability mechanism and zero data loss, you must enable spark streaming's write ahead log mechanism (Write Ahead Log, WAL) The mechanism will synchronously write the received Kafka data to the write-ahead log on the distributed file system (such as hdfs). Therefore, even if the underlying node fails, the data in the write-ahead log can be used for recovery

2. Direct-based approach

  • This direct method that is not based on Receiver was introduced in spark1.3 to ensure a more robust mechanism. Instead of using Receiver to receive data, this method will periodically query Kafka to obtain each topic. The latest offset of +partition, which defines the range of offset for each batch. When the job processing data starts, it will use Kafka's simple consumer api to get the data of the specified offset range of Kafka

The advantages are as follows:

  • Simplify parallel reading: If you want to read multiple partitions, you don't need to create multiple input DStreams and then perform union operations on them. Spark will create as many RDD partitions as kafka partitions. And will read data from kafka in parallel. So There is a one-to-one mapping relationship between kafka partition and RDD partition
  • High performance: If you want to ensure zero data loss, you need to turn on the WAL mechanism in the receiver-based method. This method is actually inefficient, because the data is actually copied twice, and Kafka itself has a high-reliability mechanism. The data is copied, and here will be copied to the WAL. Based on the direct method, it does not rely on Receiver and does not need to open the WAL mechanism. As long as the data is copied in Kafka, then it can be restored through the copy of Kafka
  • Once and only once transaction mechanism

3. Comparison of the two

  • Based on the receiver method, Kafka's high-level API is used to save the consumed offset in zookeeper. This is the traditional way of consuming Kafka data. This method, in conjunction with the WAL mechanism, can ensure high reliability with zero data loss, but However, there is no guarantee that the data will be processed once and only once, and it may be processed twice. Because spark and zookeeper may not be synchronized
  • Based on the direct method, using the simple api of Kafka, SparkStreaming itself is responsible for tracking the consumption offset and saving it in the checkpoint.spark itself must be synchronized, so it can ensure that the data is consumed once and only once.
  • 在实际生产环境中大都用Direct方式

2. Principle of Spark Streaming window function

  • The window function is to encapsulate again on the basis of the originally defined SparkStreaming calculation batch size.Each time it calculates multiple batches of data, it also needs to pass a sliding step parameter to set the next calculation task after the completion of the calculation. Where to start counting at a time
  • In the figure, time1 is the batch size calculated by SparkStreaming. The dashed box and the solid large box are the size of the window, which must be an integer multiple of the batch. The distance between the dashed box and the large solid box (how many batches are apart) is the sliding step size
    Insert picture description here

Three. Spark streaming fault tolerance principle

One feature of spark streaming is high fault tolerance

  • First of all, spark rdd has a fault-tolerant mechanism.Each rdd is an immutable distributed and recalculated data set, which records this deterministic operation pedigree, so as long as the input data is fault-tolerant, then any rdd partition error Or unavailable, all can be recalculated using the original input data through conversion operations
  • Write-ahead logs are usually used in databases and file systems to ensure the durability of data operations. Write-ahead logs usually write the operation to a durable and reliable log file, and then apply the operation to the data. An exception occurred in the operation, you can read the log file and reapply the operation
  • In addition, the correctness of the received data is only confirmed by the receiver after the data is pre-written to the log. Data that has been cached but not yet saved can be sent by the data source again after the driver is restarted. These two mechanisms ensure zero data. Lost, all data is either recovered from the log or resent by the data source

Guess you like

Origin blog.csdn.net/sun_0128/article/details/107974157