Let’s talk about Kappa architecture

Analysis &Answer

For real-time data warehouses, the Lmabda architecture has obvious shortcomings. First, maintaining two systems at the same time results in high resource usage. Second, the data processing logic of the two systems is the same, and the code is developed repeatedly.

Is there an architecture that only needs to maintain one system to complete stream processing and batch processing tasks at the same time? Of course, that is the Kappa architecture.

Kappa architecture

The Kappa architecture is a true stream-batch integrated processing method. It is a real-time data warehouse architecture proposed by LinkedIn with the gradual improvement of the stream processing engine.

Kappa architecture

This architecture is equivalent to removing the batch layer (Batch Layer) from the Lambda architecture, leaving only a separate stream processing layer (Speed ​​Layer). The upstream replay (backtracking) capability is achieved through the data retention function of the message queue.

When code changes occur in the flow task, or when backtracking calculations are required, the original Job N remains unchanged. First, a new job Job N+1 is started to obtain historical data from the message queue, perform calculations, and store the calculation results in the new data. table.

When the calculation progress catches up with the previous Job N, Job N+1 replaces Job N and becomes the latest stream processing task. Then the program switches to reading data from the new data table, stops the historical job Job N, and deletes the old data table.

Of course, this architecture can be optimized to merge the two output tables into one to reduce the work of the operation and maintenance part.

Compared with the Lambda architecture, this architecture is lower in throughput and performance than the Lambda architecture, because the batch processing of the Lambda architecture is the core part of the entire throughput and performance.

However, Kappa unifies the data processing architecture, reduces the waste of computing resources, and reduces operation and maintenance costs. Moreover, the code only needs to be written and maintained once, but Kappa cannot solve the inconsistency in some processing logic between stream processing and batch processing.

Kappa architecture selection

In the selection of Kappa architecture, Kafka is often chosen for message queues because it has the function of saving and replaying historical data and supports multiple consumers.

For stream processing clusters, Flink is generally chosen because Flink supports stream-batch integrated processing and its support for SQL is gradually increasing, so it can minimize the inconsistency between stream processing and batch processing logic codes.

For data services, there are still database products that require real-time reading and writing. Common ones include HBase, Druid, ClickHouse, etc.

However, when using Kafka as a message queue, please note that because Kafka stores messages in memory first and then writes them to disk, data loss may occur.

If financial-level data reliability is required, using a message queue such as Rabbit MQ or Rocket MQ, which supports data persistence directly to disk, may be a better choice, but it will sacrifice data real-time and throughput accordingly.

Reflect & Expand

There is no difference between Kappa architecture and Lambda architecture, they are just applicable to different scenarios.

Meow Interview Assistant:One-stop solution to interview questions, you can search WeChat applet[Meow Interview Assistant] Free questions. If you have any good interview knowledge or skills, I look forward to sharing them with you! [Meowing Questions] -> Interview Assistant Or follow

Guess you like

Origin blog.csdn.net/jjclove/article/details/127407176