Flink introduction, characteristics and comparison with other big data frameworks

What is Flink

In the era of rapid increase in data volume, a large amount of business data is generated in various business scenarios. How to effectively process these continuously generated data has become a problem faced by most companies today. At present, the popular big data processing engine Apache Spark has basically replaced MapReduce as the current big data processing standard. But for real-time data processing, Apache Spark's Spark-Streaming still has room for performance improvement. Spark-Streaming's stream computing is essentially batch (micro-batch) computing. Apache Flink is a pure real-time distributed processing framework that can simultaneously support high throughput, low latency, and high performance among the technologies that have been continuously developed in the open source community in recent years.

Flink definition

Apache Flink is a framework and distributed processing engine for stateful computing on unbounded and bounded data streams. Flink can run in all common cluster environments, and can perform calculations at memory speeds and at any scale.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed torun in all common cluster environments, perform computations at in-memory speed and at any scale.

Bounded flow and unbounded flow

Unbounded flow: The beginning of the flow is defined, but the end of the flow is not defined. They generate data endlessly. Unbounded flow data must be processed continuously, that is, the data needs to be processed immediately after being ingested. You can't wait until all the data arrives before processing, because the input is infinite, and the input will not end at any time. Processing unbounded data usually requires ingesting events in a specific order, such as the order in which they occur, so that the completeness of the results can be inferred.
Bounded flow: There is the beginning of the defined flow and the end of the defined flow. Bounded flow can be calculated after all the data is ingested. All data in a bounded stream can be sorted, so there is no need for orderly ingestion. Bounded stream processing is usually called batch processing

Stateful computing architecture

The data flow is the continuous generation of real events in chronological order. It is difficult for us to directly perform calculations and generate statistical results in the process of data generation, because this not only places very high requirements on the system, but also must meet many goals such as high performance, high throughput, and low latency.
The stateful flow computing architecture (as shown in the figure) is proposed to meet the needs of enterprises to a certain extent. Based on real-time streaming data, the state of all calculation processes is maintained (the so-called state is the intermediate calculation generated during the calculation process). Result), every time new data enters the system, the calculation is performed based on the intermediate state results, and the correct statistical results are finally generated.
The biggest advantage of stateful computing is that it does not need to retrieve the original data from external storage for full calculations (this calculation method is very expensive). At the same time, users do not need to schedule and coordinate various batch calculation tools to obtain data statistics results from the data warehouse, which can greatly reduce the system's dependence on other frameworks, reduce time loss and hardware storage in the data calculation process.

Why use Flink

Stateful flow computing will gradually become an architectural model for enterprises as a data platform. At present, from the perspective of the community, only Apache Flink can be satisfied. Flink realizes a real-time streaming computing framework with high throughput, low latency, and high performance by implementing the Google Dataflow streaming computing model. At the same time, Flink supports highly fault-tolerant state management to prevent the state from being lost due to system abnormalities during the calculation process. Flink periodically implements the persistent maintenance of the state through the distributed snapshot technology Checkpoints, so that even when the system is down or abnormal Can calculate the correct result

Application scenarios

In theory, Flink can be used in all big data scenarios, such as financial transaction data, Internet order data, GPS positioning data, sensor signals, data generated by mobile terminals, communication signal data, etc., as well as the familiar network traffic monitoring and server logs. Data. The most common feature of these data is that they are generated from different data sources in real time, and then transmitted to the downstream analysis system. Mainly include real-time intelligent recommendation, complex event processing, real-time fraud detection, real-time data warehouse and ETL type, streaming data analysis type, real-time report type and other real-time business scenarios

Features and advantages

Flink has the following features:
1. Supports high throughput, low latency, and high performance at the same time.
Flink is currently the only distributed stream data processing framework in the open source community that integrates high throughput, low latency, and high performance. Like Apache Spark, it can only take into account both high-throughput and high-performance features, while the streaming computing framework Apache Storm can only support low-latency and high-performance features, but cannot meet the requirements of high-throughput.
2. The concept of event time (Event Time) is
currently supported. The frame window calculation uses the system time (Process Time), which is also the current time of the system host when the event is transmitted to the calculation frame for processing. Flink can support window calculations based on Event Time semantics, that is, using the time when the event is generated. This event-driven mechanism enables the streaming system to calculate accurate results even if the events arrive out of order, keeping the events The timing of the original generation, avoid the influence of network transmission or hardware system as much as possible
. 3. Support stateful calculation The
so-called state is to save the intermediate result data of the operator in the memory or file system during the streaming calculation process, and wait for the next one After the event enters the operator, the current result can be calculated from the intermediate result obtained from the previous state, so there is no need to calculate the result based on all the original data every time. This method greatly improves the performance of the system and reduces the data calculation. Process resource consumption
4. Support highly flexible Window operation
In stream processing applications, data is continuous, and a certain range of aggregate calculations on stream data needs to be performed through a window, such as counting how many users clicked on a webpage in the past 1 minute. In this case, we must Define a window to collect the data in the last minute and recalculate the data in this window. Flink divides windows into window operations based on Time, Count, Session, and Data-driven. The window can be customized with flexible trigger conditions to support complex streaming modes. Users can define different window trigger mechanisms. To meet different needs
5. Fault-tolerant
Flink based on lightweight distributed snapshots (CheckPoint) can run on thousands of nodes in a distributed manner, disassembling the process of a large computing task into small computing processes, and then tesk Distributed to parallel nodes for processing. During task execution, errors in the event processing process can automatically find data inconsistencies, such as node downtime, network transmission problems, or restart of computing services due to user upgrade or repair problems. In these cases, through Checkpoints based on distributed snapshot technology, the state information in the execution process is stored persistently. Once the task stops abnormally, Flink can automatically restore the task from Checkpoints to ensure that the data is in the process of processing. Exactly-Once
6. Implement independent memory management based on JVM
Flink implements its own memory management mechanism to minimize the impact of JVM GC on the system. In addition, Flink uses serialization/deserialization methods to convert all data objects into binary data and store them in memory. While reducing the size of data storage, it can more effectively use memory space and reduce the performance degradation caused by GC. The risk of abnormal tasks, so Flink will be more stable than other distributed processing frameworks, and will not affect the operation of the entire application due to issues such as JVM GC.
7. Save Points (Save Points)
For streaming applications that run 7*24 hours, data is continuously accessed. The termination of the application within a period of time may result in data loss or inaccurate calculation results, such as upgrading the cluster version, shutting down operation and maintenance, etc. . Flink saves the snapshot of task execution on the storage medium through the Save Points technology. When the task restarts, it can directly restore the original calculation state from the saved Save Points, so that the task continues to run in the state before the shutdown. The Save Points technology can Allow users to better manage and operate real-time streaming applications

Comparison of streaming computing frameworks

The development of the computing engine has gone through several processes, from the first generation of MapReduce, to the second generation of Tez based on directed acyclic graphs, the third generation of Spark based on memory computing, and then to the fourth generation of Flink. Comparison of various frameworks as follows:
Insert picture description here

Model: Storm and Flink are really processing data one by one; while Trident (Storm's encapsulation framework) and Spark Streaming are actually small batches, processing a batch of data at a time (small batches).
API: Storm and Trident both use basic APIs for development, and the operation is relatively complicated; while Spark Streaming and Flink both provide encapsulated high-order functions, which can be used directly, which is more convenient.
Guaranteed times: In terms of data processing, Storm can be processed at least once, but it cannot be guaranteed to be processed only once. This will lead to the problem of repeated data processing. Therefore, for counting requirements, some errors may occur; Trident can guarantee the correctness through transactions. Data is processed only once, as does Spark Streaming and Flink.
Fault tolerance mechanism: Storm and Trident can achieve data fault tolerance through the ACK mechanism, while Spark Streaming and Flink can achieve fault tolerance through the CheckPoint mechanism.
State management: State management is not implemented in Storm. Spark Streaming implements DStream-based state management, while Trident and Flink implement operation-based state management.
Latency: Indicates the latency of data processing. Because Storm and Flink process a piece of data when they receive a piece of data, the latency of data processing is very low; while Trident and Spark Streaming are both small batches, and their data is processed The latency of the system will be relatively high.
Throughput: The throughput of Storm is actually not low, but it is low compared to several other frameworks; Trident is medium; and the throughput of Spark Streaming and Flink is relatively high.

Guess you like

Origin blog.csdn.net/zhangxm_qz/article/details/108112717