Flink execution engine: the integration road of flow batch

Introduction:  This article is shared by Apache Flink Committer Ma Guowei. It mainly introduces the integration of Flink as a big data computing engine.

This article, shared by Apache Flink Committer Ma Guowei, mainly introduces the integration of Flink as a big data computing engine. content include:

1. Background
2. Stream-batch integrated layered architecture
3. Stream-batch integrated DataStream
4, Stream-batch integrated DAG Scheduler
5, Stream-batch integrated Shuffle architecture
6, Stream-batch integrated fault tolerance strategy
7, Future prospects

1. Background

image.png

With the continuous development of the Internet and mobile Internet, all walks of life have accumulated massive amounts of business data. In order to improve user experience and enhance the competitiveness of products in the market, companies have adopted real-time methods to process big data. The real-time large screen of social media, the real-time recommendation of e-commerce, the real-time traffic forecast of the city brain, and the real-time anti-fraud of the financial industry. The success of these products shows that real-time big data processing has become an unstoppable trend.

image.png

Under the real-time trend, Flink has become the de facto standard in the real-time computing industry. We see that not only Alibaba, but also leading manufacturers in various fields at home and abroad use Flink as the technology base for real-time computing. Domestically, there are ByteDance, Tencent, and Huawei, and foreign countries include Netflix, Uber, and so on.

The real-time business is just a starting point. One of Flink's goals is to provide users with a real-time and offline integrated user experience. In fact, many users not only need real-time data statistics, but in order to confirm the effect of operations or product strategies, users also need to compare with historical data (yesterday, or even the same period last year). From the user's point of view, the original streaming and batch independent solutions have some pain points:

  • The labor cost is relatively high . Since flow and batch are two systems, the same logic needs to be developed twice by two teams.
  • Data link redundancy . In many scenarios, the content of stream and batch calculations are actually the same, but because there are two systems, the same logic still needs to be run twice, resulting in a certain amount of waste of resources.
  • The data caliber is inconsistent . This is the most important problem encountered by users. Two sets of systems, two sets of operators, and two sets of UDFs are bound to produce errors of varying degrees. These errors cause great trouble to the business side. These errors can not be solved simply by relying on the investment of manpower or resources.

image.png

On Double Eleven in 2020, when the real-time peak reaches a record high of 4 billion, the Flink team and the DT team have jointly launched a data warehouse architecture based on Flink's full-link flow and batch integration, which solves the problem of the Lambda architecture. A series of problems came: Stream batch jobs use the same SQL, which increases R&D efficiency by 3 to 4 times; a set of engines ensures that the data caliber is naturally consistent; stream batch jobs run in the same cluster, which greatly improves resource efficiency .

The success of Flink's flow and batch integration is inseparable from the healthy and vigorous development of the Flink open source community. From the Apache Software Foundation's 2020 annual report, it can be seen that Flink ranks among the top in three key indicators reflecting the prosperity of the open source community: user mailing list activity, Flink ranks first; developer submissions Flink ranks second, and Github users The number of visits ranked second. These data are not limited to the field of big data, but all projects under the Apache Open Source Foundation.

image.png

2020 is also the second year for Blink to feed back the community. In the past two years, we have gradually contributed the experience accumulated by Blink in the group back to the community, making Flink a truly integrated platform for streaming and batching. I hope to use this article to share with you what Flink has done in terms of execution engine flow batch integration in the past two years. At the same time, I also hope that old users and new friends of Flink can further understand the "past and present" of Flink's integrated flow batch architecture.

Second, the layered architecture of flow batch integration

image.png

In general, Flink's core engine is mainly divided into the following three layers:

  • SDK layer . There are two main types of Flink SDK. The first type is the relational Relational SDK, which is SQL/Table, and the second type is the physical SDK, which is DataStream. Both types of SDKs are unified in streaming and batching, that is, whether it is SQL or DataStream, the user's business logic can be used in both streaming and batch scenarios at the same time as long as the user's business logic is developed once;
  • Execution engine layer . The execution engine provides a unified DAG to describe the data processing process Data Processing Pipeline (Logical Plan). Regardless of whether it is a stream task or a batch task, the user's business logic will be transformed into this DAG diagram before execution. The execution engine uses the Unified DAG Scheduler to transform this logical DAG into a Task that is executed in a distributed environment. To transfer data between tasks through Shuffle, we use the Pluggable Unified Shuffle architecture, and support two shuffle modes of streaming batch at the same time;
  • State storage . The state storage layer is responsible for storing the state execution state of the operator. For streaming jobs, there are open source RocksdbStatebackend, MemoryStatebackend, and commercial versions of GemniStateBackend; for batch jobs, we have introduced BatchStateBackend in the community version.

This article mainly shares the following aspects:

  1. Streaming and batching integrated DataStream introduces how to use the streaming and batching integrated DataStream to solve the current challenges faced by the Flink SDK;
  2. The DAG Scheduler that integrates streaming and batching introduces how to fully tap the performance advantages of streaming engines through the unified Pipeline Region mechanism; how to dynamically adjust the execution plan to improve the ease of use of the engine and increase the resource utilization of the system;
  3. The Shuffle architecture that integrates streaming and batching introduces how to use a unified Shuffle architecture to meet the strategic customization requirements of different Shuffles, and at the same time avoid repeated development on common requirements;
  4. The fault-tolerant strategy of stream-batch integration introduces how to use a unified fault-tolerant strategy to not only meet the fault tolerance in batch scenarios but also improve the fault tolerance effect in streaming scenarios.

Three, stream batch integrated DataStream

SDK analysis and challenges

image.png

As shown in the figure above, there are currently three main types of SDKs provided by Flink:

  1. Table/SQL is a kind of advanced SDK of relational, which is mainly used in some data analysis scenarios. It can support Bounded or Unbounded input. Since Table/SQL is Declarative, the system can help users perform many optimizations. For example, according to the Schema provided by the user, it can perform Filter Push Down predicate push-down, on-demand deserialization of binary data, and other optimizations. Currently Table/SQL can support two execution modes, Batch and Streaming. [1]
  2. DataStream is a kind of Physical SDK. Although the Relatinal SDK is powerful, it also has some limitations: it does not support operations on State and Timer; due to the upgrade of the Optimizer, the physical execution plan may be incompatible in the two versions with the same SQL. The DataStream SDK can not only support the Low Level operations of State and Timer dimensions. At the same time, because DataStream is an Imperative SDK, it has a good "control" over the physical execution plan, and there is no incompatibility caused by version upgrades. DataStream still has a large user base in the community. For example, there are still about 500 unclosed DataStream issues. Although DataStream can support Bounded and Unbounded Input applications written with DataStream, it only supports Streaming execution mode before Flink-1.12.
  3. DataSet is a Physical SDK that only supports Bounded input. Some operators will be optimized according to the characteristics of Bounded, but it does not support operations such as EventTime and State. Although DataSet is the earliest SDK provided by Flink, with the continuous development of real-time and data analysis scenarios, the influence of DataSet in the community is gradually declining compared to DataStream and SQL.

At present, Table/SQL is relatively mature in supporting the unified streaming batch scenario, but there are still some challenges facing the Phyiscal SDK, mainly in two aspects:

  1. Using the existing Physical SDK, it is impossible to write a stream-batch integrated Application that can be used in real production. For example, if a user writes a program to process real-time data in Kafka, it is very natural to use the same program to process historical data stored on OSS/S3/HDFS. But at present, neither DataSet nor DataStream can satisfy this "simple" demand of users. You may find it strange that DataStream does not support both Bounded Input and Unbounded Input. Why is there a problem? In fact, "the devil is hidden in the details", I will elaborate on this in the Unified DataStream section.
  2. The cost of learning and understanding is relatively high. As Flink continues to grow, more and more new users join the Flink community, but for these new users, it is necessary to learn two Physical SDKs. Compared with other engines, the learning cost for users to get started is relatively high; the two SDKs are semantically different. For example, DataStream has Watermark and EventTime, but DataSet does not. For users, understand the two sets of mechanisms. The threshold is not small; because the two SDKs are not yet compatible, once a new user chooses the wrong one, he will face a large switching cost.

Unified Physical SDK

image.png

In order to solve the challenges faced by the above Physical SDK, we use the Unified DataStream SDK as Flink's unified Physical SDK. This part mainly solves two problems:

  1. Why choose DataStream as the Unified Physical SDK?
  2. What capabilities does Unified DataStream provide than the "old" DataStream so that users can write a stream-batch integrated application that can be used in real production?

Why not Unified DataSet

In order to solve the problem of relatively high cost of learning and understanding, the most natural and simple solution is to choose one of DataStream and DataSet as the only Physical SDK of Flink. So why did we choose DataStream instead of DataSet? There are two main reasons:

  1. User benefits. As we have analyzed before, with the development of the Flink community, the current influence of the DataSet in the community is gradually declining. If you choose to use DataSet as the Unified Physical SDK, then the user's previous large "investment" in DataStream will be invalidated. And choosing DataStream can make many users' existing DataStream "investment" get an extra return;
  2. Development costs . The DataSet is too old and lacks a lot of support for the basic concepts of modern real-time computing engines, such as EventTime, Watermark, State, Unbounded Source, etc. Another deeper reason is that the implementation of the existing DataSet operator cannot be reused in streaming scenarios, such as Join. This is not the case for DataStream, which can be reused a lot. So how to reuse DataStream operators in the two scenarios of streaming batch?

Unified DataStream

Many users who have some knowledge of Flink may ask: DataStream supports Bounded/Unbounded input at the same time. Why do we say: Can't use DataStream to write a stream batch integrated application that can be used for real production? Simply put, the original main design goal of DataStream was to be used in the Unbounded scenario, so there is a certain distance between the efficiency, usability, and ease of use in the Bounded scenario from the traditional batch engine. Specifically embodied in the following two aspects:

- effectiveness

Let me show you an example first. Below is a performance comparison chart of DataStream and DataSet running a WordCount job of the same scale. As can be seen from this example, the performance of DataSet is nearly 5 times that of DataStream.

image.png

Obviously, in order for DataStream to support both streaming and batch scenarios in production, the efficiency of DataStream in Bounded scenarios must be greatly improved. So why is the efficiency of DataStream lower than that of DataSet?

As we mentioned earlier, the original main design goal of DataStream was to be used in the Unbounded scenario, and the main feature of the Unounded scenario is disorder, which means that any DataStream operator cannot assume the order of the processed Records. So many operators will use a K/V store to cache these out-of-order data, and then take out the data from the K/V store for processing and output when appropriate. In general, operator access to K/V storage involves a lot of serialization and deserialization, and it also triggers random disk I/O; and in the DataSet, assuming that the data is bounded, that is, it can be optimized Avoid random disk I/O access, and optimize serialization and deserialization. This is the main reason why WorkerCount written with DataSet is 5 times faster than WordCount written with DataStream.

Knowing the reason, is it enough to rewrite all DataStream operators? There is no problem in theory, but DataStream has a lot of operators that need to be rewritten, and some operators are more complicated, such as a series of operators related to Window. It is conceivable that if all are rewritten, the amount of engineering is very huge. So we almost completely avoided rewriting all operators through the single-key BatchStateBackend, and at the same time got very good results.

-Consistency

Students who have a certain understanding of Flink should know that the original Application written with DataStream adopts the Streaming execution mode. In this mode, the semantics of end-to-end Exactly Once is maintained through Checkpoint. Specifically, the Sink of a job Only after all the operators in the whole graph (including Sink itself) have completed their respective snapshots, Sink will Commit the data to the external system. This is a typical 2PC protocol that relies on the Flink Checkpoint mechanism.

In the Bounded scenario, although Streaming can also be used, there may be some problems for users:

  1. Large resource consumption:  Using the Streaming method, you need to get all the resources at the same time. In some cases, users may not have so many resources;
  2. High fault tolerance cost:  In the Bounded scenario, some operators may not be able to support the Snapshot operation for efficiency. Once an error occurs, the entire job may need to be re-executed.

Therefore, in the Bounded scenario, users hope that Application can adopt the Batch execution mode, because the use of Batch execution mode can solve the above two problems very naturally. Supporting the Batch execution mode in the Bounded scenario is relatively simple, but it introduces a very difficult problem-the use of the existing Sink API cannot guarantee the end-to-end Exactly Once semantics. This is because there is no Checkpoint in the Bounded scenario, and the original Sink relies on Checkpoint to ensure end-to-end ExactlyOnce. At the same time, we don't want developers to develop two different implementations for Sink in different modes, because this will not make use of the docking of Flink and other ecosystems.

In fact, a Transactional Sink mainly solves the following 4 problems:

  1. What to commit?
  2. How to commit?
  3. Where to commit?
  4. When to commit?

And Flink should let Sink developers provide What to commit and How to commit, and the system should choose Where to commit and When to commit according to different execution modes to ensure end-to-end Exactly Once. In the end, we proposed a new Unified Sink API, which allows developers to develop a set of sinks and run them in both Streaming and Batch execution modes. The main ideas presented here are only the main ideas. How to ensure the consistency of End to End in a limited flow scenario; how to connect with external ecosystems such as Hive and Iceberg, in fact, there are still certain challenges.

Four, flow batch integrated DAG Scheduler

What problem does the Unified DAG Scheduler solve

It turns out that Flink has two scheduling modes:

  1. One is the streaming scheduling mode . In this mode, the Scheduler will apply for all the resources required by a job, and then schedule all the tasks of the job at the same time, and all the tasks communicate in a pipeline way. Batch jobs can also take this approach, and the performance will also be greatly improved. But for long-running batch jobs, this mode still has certain problems: in the case of a relatively large scale, it consumes more resources at the same time, for some users, he may not have so many resources ; The cost of fault tolerance is relatively high, for example, once an error occurs, the entire job needs to be rerun.
  2. One is the batch scheduling mode . This mode is similar to the traditional batch engine. All tasks can apply for resources independently, and the communication between tasks is through Batch Shuffle. The advantage of this approach is that the cost of fault tolerance is relatively small. But this mode of operation also has some shortcomings. For example, the data between tasks are all interacted through disks, causing a lot of disk IO.

In general, with these two scheduling methods, it can basically meet the needs of the scene of stream batch integration, but there is also a lot of room for improvement, which is specifically reflected in three aspects:

1. The structure is inconsistent and the maintenance cost is high . The essence of scheduling is to allocate resources, in other words, to solve the problem of When to deploy which tasks to where. The original two scheduling modes have certain differences in the timing and granularity of resource allocation, which ultimately leads to the inability to be completely unified in the scheduling architecture, requiring developers to maintain two sets of logic. For example, in the flow scheduling mode, the granularity of resource allocation is all tasks in the entire physical execution plan; in the batch scheduling mode, the granularity of resource allocation is a single task. When the Scheduler gets a resource, it needs to go two sets according to the job type. Different processing logic;
2. Performance . The traditional batch scheduling method, although the cost of fault tolerance is relatively small, but introduces a large amount of disk I/O, and the performance is not the best, and it cannot take advantage of the Flink streaming engine. In fact, in a scenario with relatively sufficient resources, a "streaming" scheduling method can be adopted to run Batch jobs, thereby avoiding additional disk I/O and improving job execution efficiency. Especially at night, streaming jobs can release certain resources, which makes it possible for batch jobs to run in a "Streaming" way.
3. Adaptive . At present, the physical execution plans of the two scheduling methods are static. Statically generated physical execution plans have problems such as high tuning labor costs and low resource utilization.

Unified scheduling based on Pipeline Region

image.png

In order to not only give full play to the advantages of the streaming engine, but also to avoid some shortcomings in the simultaneous scheduling of the whole picture, we introduce the concept of Pipeline Region. Unified DAG Scheduler allows in a DAG diagram, tasks can be communicated through Pipeline or Blocking. These tasks connected by the data exchange method of the Pipeline are called a Pipeline Region. Based on the above concepts, Flink introduces the concept of Pipeline Region. Whether it is a stream job or a batch job, it applies for resources and schedules tasks according to the granularity of the Pipeline Region. Careful readers can find that, in fact, the original two modes are special cases of Pipeline Region scheduling.

image.png

Even if the resource can meet the "flow" scheduling mode, which tasks can be scheduled in the "flow" way?

Some students still worry that the fault tolerance cost of adopting the "streaming" scheduling method will be relatively high, because in the "streaming" scheduling method, if an error occurs in a task, all the tasks connected to it will fail and then run again.

In Flink, there are two connection methods between different tasks [2], one is the All-to-All connection method, the upstream task will be connected with all the downstream tasks; the other is the PointWise link method, the upstream The task will only be connected to some downstream tasks.

If all Tasks of a job are connected through the All-to-All method, once the "flow" scheduling method is adopted, then the entire physical topology needs to be scheduled at the same time, then there is indeed a problem of high FailOver cost [3 ]. However, in the topology of the actual Batch job, not all Tasks are connected by All-to-All edges. A large number of Tasks in the Batch job are connected by PointWise edges, and the Tasks connected by PointWise are scheduled through a "flow" method. Connectivity graph, while reducing the fault tolerance cost of the job, can improve the execution efficiency of the job. As shown in the following figure, in the full 10T TPC-DS test, all PointWise edges are turned on and the Pipeline link method can be used to improve the performance. There is a performance improvement of more than 20%.

image.png

The above is just one of the four strategies for dividing the Pipeline Region provided by Schduler [4]. In fact, Planner can customize which tasks are used in the pipeline transmission mode and which tasks are used in the batch transmission mode according to the actual operating scenario. .

Adaptive scheduling

The essence of scheduling is the decision-making process of resource allocation to the physical execution plan. Pipeline Region solves the problem of determining the physical execution plan, stream jobs and batch jobs can be scheduled uniformly according to the granularity of the pipeline region. For batch jobs, there are some problems with statically generating physical execution plans [5]:

  1. The cost of configuration is high . For batch jobs, although it is theoretically possible to infer the concurrency of each stage in the physical execution plan based on statistical information, due to problems such as a large number of UDFs or lack of statistical information, the static decision results may be seriously inaccurate. In order to ensure the SLA of business operations, during the big promotion period, business students need to manually adjust the concurrency of high-quality batch jobs according to the traffic estimation of the big promotion. Due to the rapid changes in business, once the business logic changes, it will continue Repeat this process. The entire tuning process requires manual operation by business classmates, and the labor cost is relatively high. Even so, there may be misjudgment and the user's SLA cannot be met;
  2. Low resource utilization . Because the cost of manually configuring the concurrency is relatively high, it is impossible to manually configure the concurrency for all jobs. For low- and medium-priority jobs, business students will choose some default values ​​as the concurrency, but in most cases these default values ​​are too large, resulting in a waste of resources; and although high-priority jobs can be configured manually, Because the configuration method is relatively cumbersome, after the big promotion, although the traffic has dropped, the business side will still use the configuration during the big promotion, which also causes a lot of waste of resources;
  3. Poor stability . The waste of resources eventually leads to the phenomenon of excessive application of resources. At present, most batch jobs are mixed with stream job clusters. Specifically, the requested resources are non-guaranteed resources. Once resources are tight or machine hotspots appear, these non-guaranteed resources are the objects of priority to be adjusted.

image.png

In order to solve these problems of statically generated physical execution, we introduced an adaptive scheduling function for batch jobs [6]. Compared with the original static physical execution plan, using this feature can greatly improve user resource utilization. Adaptive Scheduler can dynamically determine the concurrency of the current JobVertex according to the execution of the upstream JobVertex of a JobVertex. In the future, we can also dynamically determine what operators to use downstream based on the data output by the upstream JobVertex.

Fifth, the Shuffle architecture with integrated batching

Flink is a platform that integrates streaming and batching, so the engine has to provide streaming and batch shuffles for different execution modes. Although Streaming Shuffle and Batch Shuffle have certain differences in specific strategies, they are essentially to re-partition the data, so there are certain commonalities between different Shuffles. So our goal is to provide a unified Shuffle architecture that can not only meet the strategic customization of different Shuffles, but also avoid repeated development on common requirements.

In general, the Shuffle architecture can be divided into four levels as shown in the following figure. The Shuffle requirements of streaming and batching have certain differences in each layer, and there are also a lot of commonalities. I will do some brief analysis below.

image.png

Differences between streaming batch Shuffle

Everyone knows that batch jobs and stream jobs have different requirements for Shuffle, which can be embodied in the following three aspects:

1. The life cycle of shuffle data . The shuffle data of the stream job and the life cycle of the task are basically the same; and the shuffle data of the batch job and the task life cycle are decoupled;
2. The storage medium of the shuffle data . Because the Shuffle data life cycle of the streaming job is relatively short, the Shuffle data of the streaming job can be stored in the memory; and the Shuffle data life cycle of the batch job is uncertain, so the Shuffle data of the batch job needs to be stored on the disk Medium;
3. Shuffle deployment method [7]. Deploying the Shuffle service and computing nodes together is an advantage for streaming jobs, because it will reduce unnecessary network overhead, thereby reducing latency. But for batch jobs, this deployment method has certain problems in resource utilization, performance, and stability. [8]

Similarities between streaming batch Shuffle

Shuffle of batch job and stream job has differences and commonalities. The commonalities are mainly reflected in:

1. Meta management of data . The so-called Shuffle Meta refers to the mapping of logical data division to the physical location of the data. Regardless of streaming or batch scenarios, under normal circumstances, you need to find out the physical location of your read or write data from Meta; under abnormal conditions, in order to reduce the cost of fault tolerance, Shuffle Meta data is usually persisted化;
2. Data transmission. Logically speaking, the shuffle of stream jobs and batch jobs is to redistribute data (re-partition/re-distribution). In a distributed system, the redivision of data involves data transmission across threads, processes, and machines.

Shuffle architecture with integrated flow and batch

image.png

The Unified Shuffle architecture abstracts three components [9]: Shuffle Master, Shuffle Reader, and Shuffle Writer. Flink completes the re-division of data between operators by interacting with these three components. Through these three components, the differences in specific strategies of different Shuffle plug-ins can be met:

  1. Shuffle Master resource application and resource release. In other words, the plugin needs to notify the framework How to request/release resource. And Flink decides When to call it;
  2. The upstream operator of Shuffle Writer uses Writer to write data to Shuffle Service-Streaming Shuffle will write data to memory; External/Remote Batch Shuffle can write data to external storage;
  3. Operators downstream of Shuffle Reader can read Shuffle data through Reader;

At the same time, we also provide architectural support for the common features of streaming batch Shuffle-Meta management, data transmission, and service deployment [10], so as to avoid repeated development of complex components. Efficient and stable data transmission is one of the most complex subsystems of a distributed system. For example, problems such as upstream and downstream back pressure, data compression, and zero copy of memory must be solved during transmission. In the new architecture, it can be developed only once. At the same time, it can be used together in both streaming and batch scenarios, which greatly reduces the cost of development and maintenance.

Sixth, the fault-tolerant strategy of flow batch integration

Flink's original fault tolerance strategy is based on checkpoints. Specifically, whether a single task fails or JobMaster fails, Flink will restart the entire job according to the most recent checkpoint. Although this strategy has some room for optimization, it is basically acceptable for streaming scenes in general. Currently, checkpoints [11] are not enabled in Flink Batch operation mode, which also means that once any error occurs, the entire job must be executed from the beginning.

Although the original strategy can theoretically guarantee that the correct results will eventually be produced, it is obvious that most customers cannot accept the price paid by this fault-tolerant strategy. In order to solve these problems, we have made corresponding improvements to the fault tolerance of Task and JM.

Pipeline Region Failover

Although there is no timing Checkpoint in Batch execution mode, in Batch execution mode, Flink allows tasks to communicate through Blocking Shuffle. After the task that reads the Blocking Shuffle fails, all the data required by the task is stored in the Blocking Shuffle, so you only need to restart the task and all downstream tasks connected to it through the pipeline shuffle, instead of restarting the entire job .

In general, the Pipeline Region Failover strategy is the same as the Scheduler during normal scheduling. It splits a DAG into some Pipeline Regions connected by several Pipeline shuffles. Whenever a task fails over, it will only restart this Just the Region where the Task is located.

image.png

JM Failover

JM is a job control center, which contains various execution states of the job. Flink uses these states to schedule and deploy tasks. Once a JM error occurs, these states will all be lost. Without this information, even if all the working nodes have not failed, the new JM still cannot continue to schedule the original job. For example, because the end information of the task has been lost, after a task ends, the new JM cannot determine whether the existing state meets the conditions for scheduling downstream tasks-all input data has been generated.

It can be seen from the above analysis that the key to JM Failover is how to make a JM "restore memory". In VVR [12], we restore the key state of JM through the Operation Log mechanism.

image.png

Attentive students may have discovered that although the starting point of these two improvements is for batch scenarios, they are actually equally effective for streamed homework. The above is just a brief introduction to the ideas of the two fault tolerance strategies, in fact, there is still a lot of content worth thinking about. For example, what should we do if the upstream data of Blocking is lost? What are the key states in JM that need to be restored?

7. Future prospects

In order to provide a faster and more stable user experience than now, we have started the development of the next-generation streaming architecture; Flink has been recognized by more and more users in the integrated streaming and batching scenario, but we also know that the industry still has Many high-level traditional big data systems are worth learning. Finally, I also hope that interested friends can join us and build a big data computing engine with a perfect user experience.

Notes:

[1] Streaming and Batch are two execution modes that have nothing to do with semantics. Streaming execution mode can be simply understood as using Pipeline Shuffle between Tasks; Batch execution mode can be simply understood as using Blocking Shuffle mode between Tasks.
[2]  https://ci.apache.org/projects/flink/flink-docs-release-1.13/api/java/org/apache/flink/runtime/jobgraph/DistributionPattern.html
[3] We are developing Adaptive Shuffle mode, using this mode can avoid the problem of high fault tolerance caused by the "pure" pipeline method.
[4]  https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/graph/GlobalDataExchangeMode.html
[5] For streaming jobs , The static physical execution plan also has similar problems with batches. We provide an AutoPilot system to dynamically modify the physical execution plan. Since AutoPilot is an independent service, it does not belong to the execution engine. I won't go into details here.
[6] Due to time planning, this feature only exists in our commercial version of the execution engine VVR for the time being.
[7] In some cases, batch Shuffle Service will also be deployed with computing nodes. For example, in the Flink Session mode, although Shuffle Service and computing deployment together have a certain stability cost, for some users, this deployment mode is a result of a trade-off between cost and stability. So to a certain extent, the streaming batch Shuffle also has commonality in deployment, but it is not exactly the same.
[8] The problems with the calculation of batch jobs and the deployment of Shuffle in one node: low resource utilization and high cost. If no computing tasks continue to be deployed on the node, then the computing resources on this node will be wasted, and the early release of computing resources will also save user costs; the performance cannot be optimal because a node can only see part of the shuffle data, so a Reduce needs to pull its own data from n nodes, which will cause a large number of random IO reads, such a large number of random read IO will greatly reduce the job performance; once the stability of the node hangs, the shuffle data responsible for the entire node will be It will be lost, and then it will trigger job recalculation. The cost of this recalculation is relatively high. (Task contains user code, so the probability of this kind of node down is greater than the Shuffle node in the case of separation of storage and calculation.)
[9] Due to historical reasons, when reading Flink code, you see that it is not Reader and Writer, but It is ResultPartion/InputGate. Reader and Writer are used here to lower the understanding threshold of students who are new to Flink.
[10] Why is the deployment also considered common? You can refer to [7].
[11] Although in theory batch jobs can support checkpoints, in batch scenarios, the cost of enabling native streaming Checkpoint is relatively high. Of course, this is not to completely rule out the possibility that more suitable scenarios may be found in the future;
[12] VVR is the execution engine of Flink commercial products. Due to time planning, this function has not yet been fed back to the Flink community.

Activity recommendation:

You can experience the real-time computing Flink version of Alibaba Cloud's enterprise-level product based on Apache Flink for only 99 yuan! Click the link below to learn about the event details: https://www.aliyun.com/product/bigdata/sc?utm_content=g_1000250506

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/115231969