Author: Zen and the Art of Computer Programming
1 Introduction
Apache Flink and Apache Kafka are two well-known open source projects for building reliable, high-throughput, and low-latency data pipelines. In April 2019, the two announced a win-win cooperation. In this cooperation, Apache Kafka will provide powerful message storage capabilities, and Flink will serve as a distributed data stream processing platform for real-time calculation and analysis. Apache Kafka was designed with real-time processing of large-scale data in mind, and it supports multiple protocols, such as AMQP, Apache Pulsar, Google Pub/Sub, Amazon Kinesis Data Streams, etc. Apache Flink supports the computing model in the Apache Hadoop-based MapReduce framework, and introduces features such as batch processing and window functions to support more complex real-time application scenarios. Therefore, the two can be effectively combined to build a powerful ecosystem.
In this article, I will explain the integration architecture between Apache Flink and Apache Kafka, and how to use them in practical applications. The main content of the article is as follows:
- Introduction to Apache Flink
- Introduction to Apache Kafka
- Overview of Apache Flink + Apache Kafka Integrated Architecture
- Publish-subscribe pattern for data sources
- A stateful mechanism for stream processing
- Configuration parameters and operation guide
- Data communication protocol between Apache Flink and Apache Kafka
- Data integration practice and experience summary
The article assumes that readers are already familiar with Apache Flink and Apache Kafka, and have some experience with them.