[Flink] Flink real-time framework introduction

table of Contents

 

Architecture

application

flow

status

time

Layered API

 Operation and maintenance


Architecture

Flink is a distributed data stream processing engine used to process stateful or unbounded data streams. It can be deployed on a general distributed cluster to realize rapid calculation of massive data in memory.

Unbounded data flow: The data has a generation mark but no end mark. Data is continuously generated, so it needs to be processed continuously, and can only be divided into certain methods according to the occurrence and end of the event.

Boundary data stream: The data can be clearly generated with start and end marks, and processing generally does not require sequential processing, because boundary data streams are generally generated in sequence and can be processed by batch processing.

 

  Flink realizes the processing of unbounded data streams through precise control of state and time; through internal processing algorithms and data structures, it flexibly controls the size of fixed data sets to achieve processing of boundary data streams.

  Flink is a distributed system that requires computing resources at runtime. It can be deployed on Hadoop YARN, Apache Mesos, Kubernetes, or deployed separately in a standalone manner.

The stateful Flink application is optimized for local state access. The task state is always kept in memory, or, if the state size exceeds the available memory, it is always kept in an accessible disk data structure. Therefore, tasks perform all calculations by accessing local (usually in memory) state, resulting in very low processing latency. Flink periodically points the local state asynchronously to the persistent storage to ensure the consistency of the state once a failure occurs.

 

 

application

Flink is a framework for stateful computing on unlimited and restricted data streams. Flink provides multiple APIs at different levels of abstraction and provides dedicated libraries for common use cases. The following describes the basic definitions of terms in data stream processing

flow

Streaming is a fundamental aspect of stream processing. However, streams may have different characteristics, which affect how streams are processed. Flink is a general processing framework that can handle any type of stream. If there is boundary or unbounded flow, real-time streaming and recording

status

Every non-ordinary streaming application is stateful, that is, applications that only apply transitions to individual events do not need state. Any application that runs basic business logic needs to remember events or intermediate results in order to access them at a later point in time, for example, when the next event is received or after a certain duration.

The application state is a first-class citizen in Flink. You can see this by looking at all the functions Flink provides in the context of state processing.

Multiple state primitives : Flink provides state primitives for different data structures (such as atomic values, lists or maps). Developers can choose the most effective state primitive based on the access mode of the function.

Pluggable state backend : Manage application state in the pluggable state backend, and checkpoint by this point. Flink has different state backends that store state in memory or RocksDB (an efficient embedded disk data storage). Custom state backends can also be inserted.

Strict one-time state consistency : Flink's checkpoint and recovery algorithm can ensure the consistency of the application state when a failure occurs. Therefore, failures can be handled transparently without affecting the correctness of the application.

Very large state : Flink can maintain application state of several terabytes due to its asynchronous and incremental checkpoint algorithm.

Scalable applications : Flink supports the expansion of stateful applications by reallocating state to more or fewer workers.

 

time

Time is another important part of streaming applications. Most event streams have inherent temporal semantics, because each event is generated at a specific point in time. In addition, many common stream calculations are time-based, such as window aggregation, conversationalization, pattern detection, and time-based connections. An important aspect of stream processing is how the application measures time, that is, the difference between event time and processing time.

Flink provides a rich set of time-related functions.

Event time mode : Applications that use event time semantics to process streams calculate results based on the timestamp of the event. Therefore, whether it is processing recorded events or real-time events, event time processing can provide accurate and consistent results.

Watermark support : Flink uses watermark inference time in event time applications. Watermarking is also a flexible mechanism to weigh the delay and completeness of the result.

Post-data processing : When processing the stream in the event time mode with watermark, it may happen that the calculation has been completed before all relevant events arrive. Such events are called late events. Flink has multiple options to handle later events, such as rerouting them via side outputs and updating previously completed results.

Processing time mode : In addition to the event time mode, Flink also supports processing time semantics, which perform calculations triggered by the wall clock time of the processor. The processing time mode may be suitable for certain applications with strict low-latency requirements that can tolerate approximate results.

Layered API

Flink provides a three-layer API. Each API provides a different trade-off between simplicity and expressiveness, and targets different use cases.

 

1. The ProcessFunctions:

ProcessFunctions is the most expressive functional interface provided by Flink. Flink provides ProcessFunctions to process a single event from one or two input streams or events grouped in a window. ProcessFunctions provide fine-grained control over time and status. ProcessFunction can arbitrarily modify its state and register timers, which will trigger callback functions in the future. Therefore, ProcessFunctions can implement complex per-event business logic according to the needs of many stateful event-driven applications.

2、DataStream API

The DataStream API provides processing primitives for many common stream processing operations. These operations include window, record-by-record conversion operations, and external database queries when processing events. DataStream API supports Java and Scala languages, and pre-defined functions such as map(), reduce(), aggregate() and so on. You can implement pre-defined interfaces through extensions or implement custom functions using Java and Scala lambda expressions.

3、SQL & Table API

Flink supports two relational APIs, Table API and SQL. These two APIs are unified APIs for batch processing and stream processing, which means that on the unbounded real-time data stream and the boundary historical data stream, the relational API will execute the query with the same semantics and produce the same result. Table API and SQL use Apache Calcite to analyze, verify and optimize queries. They can be seamlessly integrated with DataStream and DataSet API, and support user-defined scalar functions, aggregate functions, and table-valued functions.

4. Library

Flink has several extension libraries suitable for common data processing application scenarios. These libraries are usually embedded in the API and are not completely independent of other APIs. They can also benefit from all the features of the API and integrate with other libraries.

Complex Event Processing (CEP): Pattern detection is a very common use case in event stream processing. Flink's CEP library provides APIs that enable users to specify event modes in, for example, regular expressions or state machines. The CEP library is integrated with Flink's DataStream API to evaluate patterns on DataStream. Applications of the CEP library include network intrusion detection, business process monitoring and fraud detection.

DataSet API: The DataSet API is the core API of Flink for batch applications. The basic operators provided by the DataSet API include map, reduce, (outer) join, co-group, iterate, etc. All operators have corresponding algorithms and data structure support, and operate on serialized data in memory. If the data size exceeds the reserved memory, the excess data will be stored to disk. The data processing algorithm of Flink's DataSet API draws on the implementation of traditional database algorithms, such as hybrid hash-join and external merge-sort.

Gelly: Gelly is an extensible graphics processing and analysis library. Gelly is implemented on top of the DataSet API and is integrated with the DataSet API. Therefore, it can benefit from its extensible and robust operators. Gelly provides built-in algorithms, such as label propagation, triangle enumeration, and page rank algorithms. It also provides a Graph API that simplifies the implementation of custom graph algorithms.

 Operation and maintenance

1, 7 * 24 hours stable operation

In a distributed system, service failure is a common occurrence. In order to ensure that the service can run stably 7*24 hours, a stream processor failure recovery mechanism like Flink is necessary. Obviously this means that it (this type of stream processor) must not only be able to restart the service when the service fails, but also ensure that the current state of each component within the service can be persisted when the failure occurs. Only in this way can it be guaranteed When the failure is restored, the service can continue to operate normally, as if the failure has not occurred.

Flink maintains the sustainable operation and consistency of applications through several mechanisms:

Checkpoint consistency : Flink's failure recovery mechanism is achieved by establishing a distributed application service status consistency checkpoint. When a failure occurs, the application service will restart and then reload the status checkpoint information of the last successful backup . Combined with replayable data sources, this feature can ensure exactly-once state consistency.

Efficient checkpointing : If an application needs to maintain a terabyte of state information, the resource overhead of establishing a checkpoint service for the state of this application is very high. In order to reduce the delay of the checkpoint service to the application (SLAs service level) Protocol), Flink uses asynchronous and incremental methods to build checkpoint services.

End-to-end accurate one-time output: Flink supports the transactional output function for certain specific storage, and can guarantee accurate one-time output in the event of a failure in time.

Integration of multiple cluster management services : Flink has been tightly integrated with multiple cluster management services, such as Hadoop YARN, Mesos, and Kubernetes. When a process task in the cluster fails, a new process service will automatically start and replace it to continue execution.

Built-in high-availability service: Flink has a built-in high-availability service module to solve single-point failure problems. This module is implemented based on Apache ZooKeeper technology. Apache ZooKeeper is a reliable, interactive, and distributed coordination service component

2. Flink can upgrade, migrate, suspend and resume application services more conveniently

Streaming applications that drive critical business services are frequently maintained. For example, it is necessary to repair system vulnerabilities, improve functions, or develop new functions. However, upgrading a stateful streaming application is not a simple matter, because when we simply stop and restart the current streaming application in order to upgrade an improved version, we cannot lose the current state information of the streaming application.

And Flink's Savepoint service is a unique and powerful component to solve the problem of recording streaming application status information and related problems during the upgrade service. A Savepoint is a consistent snapshot of the application service state, so it is very similar to the checkpoint component, but compared with the checkpoint, the Savepoint needs to be manually triggered to start, and it will not be automatically deleted when the streaming application service stops. Savepoint is often used to start a streaming service that already contains a state and initialize its state (during backup). Savepoint has the following characteristics:

Easy to upgrade the application service version: Savepoint is often used when the application version is upgraded. When the new version of the current application is updated and upgraded, the service can be restarted according to the service status information in Savepoint recorded by the previous version of the program. It may also use an earlier Savepoint restore point to restart the service in order to repair incorrect program operation results due to defective program versions.

Convenient cluster service migration: By using Savepoint, streaming service applications can be freely migrated and deployed in different clusters.

Convenient Flink version upgrade: By using Savepoint, you can make application services more secure and convenient when upgrading Flink.

Increase the scalability of application parallel services: Savepoint is often used when increasing or reducing the parallelism of application service clusters.

Convenient for A/B testing and hypothesis analysis scenario comparison results: By using different versions of applications for the same application and starting services based on the same Savepoint restore point, you can test and compare the performance and service quality of two or more versions of the program.

Pause and resume service: An application service can be stopped after a new Savepoint is created, so that it can be resumed at any later point in time based on this real-time refreshed Savepoint restore point.

Archive service: Savepoint also provides archive service of restore point, so that users can reset the status of the application service and restore the service by specifying the service data of Savepoint at the time point

3. Monitor and control application services

Like other application services, continuously running streaming application services also need to be monitored and integrated into some infrastructure resource management services, such as a component monitoring service and log service. Monitoring services help predict problems and respond in advance. Logging services provide log records that can help track, investigate, and analyze the root causes of failures. Finally, the convenient and easy-to-use interface for access control application services is also an important highlight of Flink.

Flink integrates well with many common logging and monitoring services, and provides a REST API to control application services and query application information. The specific performance is as follows:

Web UI method : Flink provides a web UI to observe, monitor and debug running application services. And can also execute or cancel the execution of components or tasks.

Log integration service : Flink implements the popular slf4j log interface and integrates with the log framework log4j or logback.

Indicator service : Flink provides a complex measurement system to collect and report system and user-defined measurement indicator information. Measurement information can be exported to multiple report component services, including JMX, Ganglia, Graphite, Prometheus, StatsD, Datadog, and Slf4j.

Standard WEB REST API interface services : Flink provides a variety of REST API interfaces, such as submitting new applications, obtaining Savepoint service information of running applications, and canceling application services. The REST API also provides metadata information and collected indicator information of the application service in operation or after completion.

Guess you like

Origin blog.csdn.net/henku449141932/article/details/109983749