Flink Stream-Batch Integrated Computing (1): Overview of Stream-Batch Integrated and Flink

Apache Flink came into being

The wave of the digital economic revolution is subversively changing the way people work and live. The digital economy is playing an increasingly important role in global economic growth, with the Internet, cloud computing, big data, Internet of Things, and artificial intelligence as The digital technology represented has developed rapidly in recent years, and the deep integration of digital technology and traditional industries has released huge energy and become a strong driving force leading economic development.

Big data technology has gradually emerged in China since 2008, and it has been more than ten years now. During this time, IT is also developing rapidly, and the emergence and use of big data will undoubtedly provide a helping hand for the rapid development of IT. As time goes by, more and more companies have higher requirements for real-time processing, hoping to minimize the time delay between data generation and complete processing, and to be able to deal with various complex problems brought about by real-time processing. Such as data delay, data state preservation, complex event detection mechanism, etc.

Apache Flink came into being in this context. It is a distributed open source computing framework for data stream processing and batch data processing. It is based on the same Flink Streaming Execution Model (Streaming Execution Model) and can support streaming There are two types of applications, processing and batch processing.

When Flink implements stream processing and batch processing, it is completely different from traditional solutions. It looks at stream processing and batch processing from another perspective and unifies the two: Flink fully supports stream processing, which means it is regarded as stream processing. The input data stream is unbounded during batch processing; while batch processing is regarded as a special stream processing, only its input data stream is defined as bounded.

Why should we engage in stream batch integration?

​ Through the batch-flow integrated computing engine, many benefits can be obtained on the data processing link:

  • To reduce learning costs, users no longer need to learn two sets of computing engines, and through a unified engine, using the same computing semantics, the possibility of errors is greatly reduced.
  • To reduce resource consumption, under the original lamda architecture, there are two data processing channels of batch and stream at the same time. Through the integration of stream and batch, there will be only one data processing channel. In addition, compared with batch computing that faces a large number of data sets in a short period of time, stream computing faces smaller data sets, and the required computing resources will be greatly reduced.
  • Reduce the complexity of the architecture, batch computing meets integrity, stream computing provides real-time, batch computing and stream computing are associated with different upstream and downstream, resulting in extremely complex data processing architecture, through stream-batch integration and stream-batch integration upstream and downstream, The simplified data processing architecture not only brings simplicity and beauty in the architecture, but also unifies and stabilizes business processing.
  • Improve the efficiency of value output. By using stream computing instead of batch computing, the original high-latency data output becomes more real-time, which can more effectively support the value output of the business.

Building a stream-batch integrated architecture based on Apache Flink

First of all, Flink is a set of Flink SQL development, and there are no two sets of development costs. A development team and a set of technology stacks can do all offline and real-time business statistics.

Second, there is no redundancy in the data link, and the calculation of the detailed layer is enough, and there is no need to calculate it offline again.

Third, the data caliber is naturally consistent. Whether it is an offline process or a real-time process, it is a set of engines, a set of SQL, a set of UDF, and a set of developers, so it is naturally consistent, and there is no problem of inconsistency between real-time and offline data.

Guess you like

Origin blog.csdn.net/victory0508/article/details/131310092