Flink learning from 0 to 1-Chapter 1 Introduction to Flink

1.1 Initial Flink

Flink originated from the Stratosphere project. Stratosphere was a research project jointly conducted by 3 universities in Berlin and some other universities in Europe from 2010 to 2014. In April 2014, the code of Stratosphere was copied and donated to the Apache Software Fund. Yes, the initial members participating in this incubation project are the core developers of the Stratosphere system. In December 2014, Flink became the top project of the Apache Software Foundation.

In German, the word Flink means fast and dexterous. The project uses a squirrel's color pattern as the logo. This is not only because the squirrel is fast and dexterous, but also because the squirrel in Berlin has a charming reddish brown. The squirrel logo has a cute tail, and the color of the tail echoes the color of the Apache Software Foundation's logo, that is to say, this is an Apache-style squirrel.

The idea of ​​the Flink project is: "Apache Flink is an open source stream processing framework for distributed, high-performance, readily available and accurate stream processing applications."

Apache Flink is a framework and distributed processing engine for stateful computing of unbounded and bounded data streams. Flink is designed to run in all common cluster environments and perform calculations at memory execution speed and any scale.

Insert picture description here

1.2 Important features of Flink

1.2.1 Event-driven

An event-driven application is a type of stateful application that extracts data from one or more event streams and triggers calculations, status updates, or other external actions based on incoming events. Typically, message queues represented by Kafka are almost all event-driven applications.

The difference is SparkStreaming micro-batch, as shown in the figure:

Insert picture description here

Event-driven:

Insert picture description here

1.2.2 The world view of flow and criticism

The characteristics of batch processing are bounded, durable, and large-scale. It is very suitable for computing tasks that require access to a full set of records. It is generally used for offline statistics.

The characteristics of stream processing are unbounded and real-time. It does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system. It is generally used for real-time statistics.

In the world view of spark, everything is composed of batches, offline data is a large batch, and real-time data is composed of infinite small batches.

In flink's world view, everything is composed of streams. Offline data is a stream with boundaries, and real-time data is a stream without boundaries. This is the so-called bounded stream and unbounded stream.

Unbounded data streams : Unbounded data streams have a beginning but no end. They will not terminate and provide data when they are generated. Unbounded streams must be processed continuously, which means that events must be processed immediately after acquisition. For unbounded data streams we cannot wait for all data to arrive, because the input is unbounded and will not be completed at any point in time. Processing unbounded data usually requires events to be obtained in a specific order (such as the order in which they occur) so that the completeness of the results can be inferred.

Bounded data flow : Bounded data flow has a clearly defined beginning and end. It can process the bounded flow by acquiring all data before performing any calculation. Processing the bounded flow does not require orderly acquisition, because the bounded data can always be processed Sets are sorted, and the processing of bounded streams is also called batch processing.
Insert picture description here

The biggest benefit of this architecture that takes the flow as the worldview is extremely low latency.

1.2.3 Hierarchical API

Insert picture description here

The lowest level of abstraction only provides a stateful stream, which will be embedded in the DataStream API through a process function (Process Function). The bottom-level process function (Process Function) is integrated with the DataStream API, which enables the bottom-level abstraction of certain specific operations. It allows users to freely process events from one or more data streams and use consistent fault-tolerant status. In addition, users can register event time and handle time callbacks, so that the program can handle complex calculations.

In fact, most applications do not need the above-mentioned low-level abstractions, but are programmed for core APIs (Core APIs), such as DataStream API (bounded or unbounded stream data) and DataSet API (bounded data set). These APIs provide common building blocks for data processing, such as user-defined multiple forms of transformations (transformations), joins, aggregations, window operations (windows), and so on. The DataSet API provides additional support for bounded data sets, such as loops and iterations. The data types handled by these APIs are represented by their respective programming languages ​​in the form of classes.

Table API is a table-centric declarative programming, where the table may change dynamically (when expressing streaming data). Table API follows the (extended) relational model: tables have a two-dimensional data structure (schema) (similar to tables in relational databases), and the API provides comparable operations, such as select, project, join, group-by, aggregate, etc. . The Table API program declaratively defines what logical operations should be performed, rather than accurately determining how these operation codes look.

Although Table API can be extended by various types of user-defined functions (UDF), it is still not as expressive as the core API, but it is more concise to use (less code). In addition, Table API programs will be optimized by the built-in optimizer before execution.

You can switch seamlessly between Table and DataStream/DataSet to allow programs to mix Table API with DataStream and DataSet. The highest level of abstraction provided by Flink is SQL. This level of abstraction is similar to Table API in terms of syntax and expressiveness, but expresses programs in the form of SQL query expressions. The SQL abstraction interacts closely with the Table API, and SQL queries can be executed directly on the tables defined by the Table API.

At present, Flink as a batch processing is not the mainstream, not as mature as Spark, so DataSet is not used much. Flink Table API and Flink SQL are also not perfect, and most of them are customized by major manufacturers. So we mainly learn the use of DataStream API. In fact, Flink, as the closest implementation of the Google DataFlow model, is a unified view of streaming batches, so it is basically sufficient to use DataStream.

Several Flink modules

  • Flink Table & SQL (not finished yet)
  • Flink Gelly (figure calculation)
  • Flink CEP (complex event processing)

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109056568