Flink Stream Batch Integrated Computing (2): Key Features of Flink

Table of contents

Key features of Flink

streaming

Rich state management

Rich temporal semantics support

   Data pipeline

fault tolerance mechanism

Flink SQL

CEP in SQL


Flink applications can consume real-time data from streaming data sources such as message queues or distributed logs (such as Apache Kafka or Kinesis ), as well as bounded historical data from various data sources. Likewise, the result streams generated by Flink applications can also be sent to various sinks .

Apache Flink is an excellent choice for developing and running many different types of applications due to its wide range of features. Flink's features include support for streaming and batch processing, sophisticated state management, event-time processing semantics, and exactly-once guarantees. In addition, Flink can be deployed on various resource management platforms, such as Yarn, Mesos, and Kubernetes, and can also be used as a standalone cluster. Flink is highly available with no single point of failure. Flink has been proven to scale to thousands of cores and terabytes of applications, and provide high throughput and low latency.

Key features of Flink

  • streaming

In a natural environment, data generation is inherently streaming. Whether it's event data from a web server, transaction data from a stock exchange, or sensor data from a machine on a factory floor, its data is streamed. Flink is a high-throughput, high-performance, low-latency real-time stream processing engine that can provide millisecond- level latency processing capabilities.

An unbounded stream is a data stream with a beginning and an end, that is, an infinite data stream, and the program must continuously process the arriving data. A bounded stream is a collection of data with a limited size that begins and ends, that is, a finite data stream. Batch processing is an example of bounded data stream processing. The data of the entire data set can be sorted, counted or aggregated and calculated before outputting the results.

The biggest benefit of this stream-based world view architecture is extremely low latency.

  • Rich state management

Flink is stateful stream processing. Flink operators are stateful. How to process an event may depend on the cumulative results of all event data before the event. Stream processing applications need to store received events or intermediate results within a certain period of time. , for access and subsequent processing at a later point in time.

The state is maintained by a task, and all data used to calculate a certain result belongs to the state of this task. The state can be considered as a local variable that can be accessed by the business logic of the task.

Flink performs state management, including state consistency, fault handling, and efficient storage and access, so that developers can focus on the logic of the application. Flink provides rich state management, including a variety of basic state types, rich State Backend , State can be stored in memory or RocksDB , etc., and supports asynchronous and incremental Checkpoint mechanism, precise once semantics, etc.

State access for Flink applications is done locally, as this helps it improve throughput and reduce latency. Typically, Flink applications store state on the JVM heap, but if the state is too large, we can also choose to store it in a structured data format on high-speed disk.

  • Rich temporal semantics support

Time is an important part of stream processing applications. For real-time stream processing applications, operations such as window aggregation, detection, and matching based on time semantics are very common. Flink provides rich time semantic support.

    • Event-time: Use the timestamp of the event itself for calculation, which makes it easier to process events that arrive out of order or arrive late.
    • Ingestion Time: is the time when data enters Flink
    • Processing Time: It is the local system time of each operator that performs time-based operations, which is related to the machine, such as the default time attribute.
    • Watermark support: Flink introduces the concept of Watermark to measure the development of event time. Watermark also provides flexible guarantees for balancing processing latency and data integrity. When processing event streams with Watermark, Flink provides a variety of processing options, such as redirecting data (side output) or updating previously completed calculation results, when relevant data still arrives after the calculation is completed.
    • Highly flexible streaming window support: Flink can support time windows, counting windows, session windows, and data-driven custom windows, which can be customized through flexible trigger conditions to implement complex streaming computing modes.
  •    Data pipeline

ETL (Extract - Transform - Load) is a common method for transforming and moving data between storage systems. Usually, the ETL task will be triggered periodically to copy the data in the business data system to the analysis database or data warehouse.

The role of  the data pipeline is similar to the ETL task. They transform data and move data from one storage system to another. However, data pipeline operations are not triggered periodically, but process data as a continuous stream. As a result, they can read records from continuously producing sources and move them to their destinations with low latency.

The biggest advantage of continuous data pipelines compared to periodic ETL jobs is lower latency. In addition, data pipelines are more general and can be used in more scenarios because they can consume and emit data continuously.

  • fault tolerance mechanism

In a distributed system, the crash or failure of a single task or node often leads to the failure of the entire task. Flink provides a task-level fault tolerance mechanism to ensure that tasks will not lose user data when an exception occurs, and can be automatically restored.

Flink 's checkpoint and fault recovery capabilities ensure the consistency of the application state of tasks before and after a fault occurs, and support the transactional output function for certain specific storage, even in the event of a fault, it can also guarantee an output exactly once.

Flink implements fault tolerance based on Checkpoint . Users can customize the Checkpoint strategy for the entire task. When a task fails, the task can be restored to the state of the latest Checkpoint , and the data after the snapshot can be resent from the data source. That is, reset the data source and restart the consumption process from the offset of the last consumption recorded in the state. Moreover, the state snapshot will acquire and store the state asynchronously during execution, and will not block the ongoing data processing logic.

Savepoint : A Savepoint is a consistent snapshot of the application state. The Savepoint is similar to the Checkpoint mechanism, but the Savepoint needs to be triggered manually. The Savepoint ensures that the status information of the current streaming application will not be lost during the task upgrade or migration, which is convenient for tasks at any point in time. Pause and resume.

  • Flink SQL

Table API and SQL rely on Apache Calcite for query parsing, verification, and optimization. It can be seamlessly integrated with DataStream and DataSet API , and supports user-defined scalar functions, aggregate functions, and table-valued functions. Simplify the definition of data analysis, ETL and other applications. The following code example shows how to use Flink SQL statements to define a counting application of session hits.

SELECT userId, COUNT(*) 
FROM clicks 
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId
  • CEP in SQL

Flink allows users to express CEP ( Complex Event Processing ) query results in SQL for pattern matching and evaluate event streams on Flink .

CEP SQL is implemented through the SQL syntax of MATCH_RECOGNIZE . The MATCH_RECOGNIZE clause is supported by Oracle SQL since Oracle Database 12c and is used to express event pattern matching in SQL . Examples of CEP SQL usage are as follows:

SELECT T.aid, T.bid, T.cid
FROM MyTable
    MATCH_RECOGNIZE (
      PARTITION BY userid
      ORDER BY proctime
      MEASURES
        A.id AS aid,
        B.id AS bid,
        C.id AS cid
      PATTERN (A B C)
      DEFINE
        A AS name = 'a',
        B AS name = 'b',
        C AS name = 'c') AS T

Guess you like

Origin blog.csdn.net/victory0508/article/details/131310229