Big Data - Playing with Data - Getting to Know FLINK for the First Time

1. Getting to know Flink first

Flink uses a colorful pattern of a squirrel as its logo
. Apache Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at in-memory execution speed and at any scale

2. Important features of Flink

1. Event-driven

An event-driven application is a class of stateful applications that ingest data from one or more event streams and trigger computations, state updates, or other external actions based on incoming events. Typically, the message queues represented by Kafka are almost all event-driven applications. (Flink's computation is also event-driven). The difference is the SparkStreaming micro-batch.

2. The world view of flow (flink) and batch (spark)

The characteristics of batch processing are bounded and large, which is very suitable for computing work that requires access to a complete set of records, and is generally used for offline statistics.
The characteristics of stream processing are unbounded and real-time. Instead of performing operations on the entire data set, operations are performed on each data item transmitted through the system. It is generally used for real-time statistics.
In spark's world view, everything is composed of batches, offline data is a large batch, and real-time data is composed of infinite small batches.
In Flink's world view, everything is composed of streams. Offline data is a bounded stream, and real-time data is an unbounded stream. This is the so-called bounded stream and unbounded stream.
Unbounded data streams:
Unbounded data streams have a beginning but no end, they do not terminate and provide data when generated, unbounded streams must be processed continuously, that is, events must be processed immediately after they are acquired. With an unbounded data stream we cannot wait for all the data to arrive because the input is unbounded and never completes at any point in time. Processing unbounded data often requires events to be fetched in a specific order (such as the order in which they occurred) so that result integrity can be inferred.
Bounded data streams:
Bounded data streams have a well-defined start and end. Bounded streams can be processed by fetching all the data before performing any calculations. Orderly fetching is not required to process bounded streams, as bounded data can always be processed Sets are sorted, and the processing of bounded streams is also known as batch processing.

3. Layered API

The lowest level of abstraction only provides stateful streams, which will be embedded in the DataStream API through process functions (Process Function). The underlying process function (Process Function) is integrated with the DataStream API, enabling it to perform underlying abstraction on certain specific operations, which allows users to freely process events from one or more data streams and use consistent fault-tolerant state. Among other things, users can register event time and handle time callbacks, allowing programs to handle complex calculations.

In fact, most applications do not need the above-mentioned low-level abstraction, but program against core APIs (Core APIs), such as DataStream API (bounded or unbounded stream data) and DataSet API (bounded data set). These APIs provide common building blocks for data processing, such as user-defined transformations, joins, aggregations, windows, and more. The DataSet API provides additional support for bounded datasets, such as loops and iterations. The data types handled by these APIs are represented by their respective programming languages ​​in the form of classes.

The Table API is table-centric declarative programming, where tables may change dynamically (while expressing streaming data). The Table API follows the (extended) relational model: a table has a two-dimensional data structure (schema) (similar to a table in a relational database), and the API provides comparable operations such as select, project, join, group-by, aggregate, etc. . Table API programs declaratively define what logical operations should be performed, rather than determining exactly what the code for those operations should look like.

Although the Table API can be extended by various types of user-defined functions (UDF), it is still not as expressive as the core API, but it is more concise to use (less code). In addition, the Table API program will be optimized by the built-in optimizer before execution.
You can seamlessly switch between Table and DataStream/DataSet to allow programs to mix Table API with DataStream and DataSet.

The highest level of abstraction provided by Flink is SQL. This level of abstraction is similar to Table API in terms of syntax and expressiveness, but expresses programs in the form of SQL query expressions. SQL abstraction interacts closely with Table API, and SQL queries can be executed directly on tables defined by Table API.
At present, Flink is not yet mainstream as a batch process, and it is not as mature as Spark, so DataSet is not used a lot. Flink Table API and Flink SQL are not perfect, and most of them are customized by major manufacturers. So we mainly learn the use of DataStream API. In fact, Flink, as the closest implementation of the Google DataFlow model, is a unified point of view of flow and batch, so basically DataStream can be used.

Version 1.12.0, released on December 8, 2020, has realized the real integration of streaming and batching. A set of written code can process both streaming data and offline data. This is different from the previous version The boundary flow method is different, and Flink is specially optimized for batch data.

三、Spark or Flink

The main difference between Spark and Flink lies in the different computing models.
Spark uses a micro-batch model, while Flink uses an operator-based continuous stream model. Therefore, the choice of Apache Spark and Apache Flink actually becomes the choice of computing model, and this choice requires trade-offs in multiple aspects such as latency, throughput, and reliability.
If an enterprise must choose one of the two mainstream frameworks, Spark and Flink, for technology selection to process stream data, we recommend using Flink. The main (obvious) reasons are: Flink’s flexible
window
Exactly Once Semantic Guarantee
Event-time (event-time) semantics (handling out-of-order data or delayed data)
These two reasons can greatly liberate programmers, speed up programming efficiency, and save the work that would have required programmers to spend a lot of effort manually Leave it to the framework to complete.

4. Application of Flink

1. Scenarios for applying Flink

Event-Driven Applications
An event-driven application is a class of stateful applications that ingest data from one or more streams of events and trigger computations, state updates, or other external actions based on incoming events.
Event-driven applications are evolved on the basis of traditional applications where computing and storage are separated. In traditional architectures, applications need to read and write to remote transactional databases.
In contrast, event-driven applications are based on stateful stream processing. In this design, data and computing are not separated, and applications only need to access local (memory or disk) to obtain data. The realization of system fault tolerance depends on periodically writing checkpoints to remote persistent storage.

Event-driven applications do not need to query remote databases, and local data access makes it have higher throughput and lower latency. And because the periodic checkpoint work to remote persistent storage can be completed asynchronously and incrementally, it has little impact on normal event processing. The benefits of event-driven applications extend beyond local data access. Under the traditional layered architecture, multiple applications usually share the same database, so any changes to the database itself (for example, changes in data layout caused by application updates or service expansion) need to be carefully coordinated. In contrast, event-driven applications only need to consider their own data, so the coordination work required to change the data representation or service expansion will be greatly reduced.

The data analysis application
Apache Flink supports both streaming and batch analysis applications.
Flink provides good support for both continuous streaming analysis and batch analysis. Specifically, it has a built-in SQL interface conforming to the ANSI standard, which unifies the semantics of batch and streaming queries. The same SQL query yields consistent results whether on a static dataset of recorded events or a real-time stream of events. At the same time, Flink also supports rich user-defined functions, allowing custom code to be executed in SQL. If you need to further customize the logic, you can use the Flink DataStream API and DataSet API for lower-level control. In addition, Flink's Gelly library provides algorithm and building block support for large-scale high-performance graph analysis on batched datasets.

Data Pipeline Applications
Extract-Transform-Load (ETL) is a common method for transforming and migrating data between storage systems. ETL jobs are typically triggered periodically to copy data from a transactional database to an analytical database or data warehouse.
Data pipelines and ETL jobs serve similar purposes, transforming, enriching, and moving data from one storage system to another. But data pipelines run in continuous streaming mode, not periodically triggered. So it supports reading records from a source that is continuously generating data and moving them to a destination with low latency. For example: a data pipeline could be used to monitor a filesystem directory for new files and write their data to an event log; another application might materialize a stream of events to a database or incrementally build and optimize an index for queries.

2. Industries where Flink is applied

E-commerce and marketing
data reports, advertising, Internet of
Things (IOT)
sensor real-time data collection and display, real-time alarm, transportation industry, logistics
distribution and service industry
real-time update of order status, push notification information, telecommunications industry base station traffic allocation
Bank and finance Real
-time settlement and notification push, real-time detection of abnormal behavior

3. Enterprises using Flink

insert image description here

Guess you like

Origin blog.csdn.net/s_unbo/article/details/130444121