Illustrate the integration of SparkStreaming and Kafka, everyone should pay attention to these details!

Preface

Insert picture description here

Lao Liu is a second graduate student who is about to find a job. On the one hand, writing a blog is to review and summarize the knowledge points of big data development, on the other hand, it is to help more self-learning friends. Since Lao Liu is self-taught in the development of big data, there will definitely be some shortcomings, and I hope everyone can criticize and correct, let us make progress together!

Today’s story is about the integration of SparkStreaming and Kafka. This article is very suitable for those who are just getting started. Everyone is welcome to express their opinions. Lao Liu will use pictures to describe some details that other technology blogs do not have. These details are right Friends who are just getting started are very useful! ! !

text

Why is there an integration of SparkStreaming and Kafka?

First of all, we need to know why there is the integration of SparkStreaming and Kafka, nothing happens for no reason!

We need to know that Spark, as a real-time computing framework, only involves calculations and does not involve data storage, so we need to use spark to connect to external data sources later. As a sub-module of Spark, SparkStreaming has 4 types of data sources:

  1. Socket data source (used during testing)
  2. HDFS data source (will be used, but not much)
  3. Custom data source (not important, I haven't seen how others will customize the data source)
  4. Extended data source (such as Kafka data source, it is very important and will be asked during the interview)

The following Lao Liu illustrates the integration of SparkStreaming and Kafka, but only the principle, the code will not be posted, there are too many online, Lao Liu writes something he understands!
Insert picture description here

SparkStreaming integrates Kafka-0.8

The integration of SparkStreaming and Kafka depends on the version of Kafka. The first thing to talk about is the integration of SparkStreaming with Kafka-0.8.

In SparkStreaming integrated kafka-0.8, the simplest way to ensure that data is not lost is to rely on the checkpoint mechanism, but the checkpoint mechanism has a problem. After the code is upgraded, the checkpoint mechanism becomes invalid. Therefore, if you want to prevent data loss, you need to manage the offset yourself.

Everyone is not unfamiliar with the code upgrade, Lao Liu will explain it well!

We often encounter two situations in our daily development. The code has a problem at the beginning, change it, then repackage and resubmit; if the business logic changes, we also need to modify the code again!

When our checkpoint is persisted for the first time, the entire related jar will be serialized into a binary file, which is a unique value as a directory. If SparkStreaming wants to restore data through checkpoint, but if the code changes, even a little bit, Failure to find the previously packaged directory will result in data loss!

So we need to manage the offset ourselves!

Insert picture description here
Use ZooKeeper cluster to manage the offset. After the program is started, the last offset will be read. After the data is read, SparkStreaming will read the data from Kafka according to the offset. After reading the data, the program will run. After running, the offset will be submitted to the ZooKeeper cluster, but there is a small problem. The program runs hung, but the offset has not been submitted. The result has been partially transferred to HBase. When it is read again, there will be data duplication. But it only affects one batch, which is too small for big data!

But there is a very serious problem . When there are a lot of consumers consuming data, the offset needs to be read. However, as a distributed coordination framework, ZooKeeper is not suitable for a large number of read and write operations, especially write operations. Therefore, ZooKeeper is not suitable for high-concurrency requests. It can only be used as a lightweight metadata storage, and cannot be responsible for high-concurrency read and write as data storage.

Based on the above, SparkStreaming integrates Kafka-1.0.

SparkStreaming integrates Kafka-1.0

Insert picture description here
Directly using kafka to save the offset offset can avoid the risk of using ZooKeeper to store the offset offset. There is also a point to note here. Kafka has a function of automatically submitting the offset, but it will cause data loss.

Because the automatic submission will be set at a certain frequency, for example, the offset will be automatically submitted every 2 seconds. But after I intercepted a piece of data, before I had time to process it, I submitted the offset just after 2 seconds, which resulted in data loss, so we usually submit the offset manually!

How to design a monitoring and alarm program?

In daily development work, we need to design a monitoring plan for real-time tasks, because the real-time task is not monitored, the program is running naked, whether the task is delayed, etc. cannot be obtained, this is a very terrible situation!

Insert picture description here
This is just a solution designed using KafkaOffsetmonitor, using it to monitor tasks, then using crawler technology to obtain the monitored information, and then import the data into openfalcon, configure alarms in openfalcon according to the strategy or develop an alarm system by yourself, and finally use the information Send enterprise WeChat or SMS to developers!

to sum up

Alright! This article mainly explains the integration process of SparkStreaming and Kafka. Liu has spent a lot of thought and talked about a lot of details. Partners who are interested in big data should remember to give him a praise and attention. Finally, if you have any questions, contact the official account: Lao Liu who works hard, and have a pleasant exchange!

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/112234389