Practice of migrating HiveSQL to FlinkSQL in Kuaishou

Abstract: This article is compiled from the sharing of Kuaishou data architecture engineer Zhang Mang and Alibaba Cloud engineer Liu Dalong in the Flink Forward Asia 2022 production practice session. The content of this article is mainly divided into four parts:

  1. Flink Stream-Batch Integrated Engine

  2. Flink Batch production practice

  3. Interpretation of core optimization

  4. future plan

Click to view the original video & speech PPT

1. Flink Stream-Batch Integrated Engine

1.1 Lambda architecture

First of all, let me introduce our thinking on choosing Flink as the stream-batch integration engine. As shown in the figure above, it is the most widely used Lambda architecture in production. I believe everyone is already familiar with it, and it is likely to be used by everyone. The advantages of the Lambda architecture are obvious:

  • flexible. The real-time link and the offline link are completely independent, developed according to actual needs, and do not affect each other;

  • Easy to land. Both real-time and offline links have mature solutions;

Of course, the shortcomings are also obvious. The two links of real-time computing and offline computing cannot be reused in storage, so resource redundancy is serious.

Then, for two computing engines, Spark is generally used for offline computing, and Flink is used for real-time computing, so it is necessary to learn and maintain two sets of codes, which is costly.

Generally, real-time and offline are developed and maintained by two teams, so it is difficult to unify the implementation details and caliber, so there are often cases where the results do not match. Therefore, business students also very much hope to realize the unification of flow batches.

1.2 Engine Unification

We divide the stream-batch integration into two aspects, one is the unification of the engine, and the other is the unification of the storage. Here is the main introduction, the unification of the engine.

If the flow batch engine is unified, users only need to learn one engine, and the developed code can be reused in large quantities. This greatly reduces the cost of development and maintenance, and the data quality is guaranteed due to the same calculation logic. In addition, Kuaishou offline job switching engine is very convenient. Therefore, the unified online rhythm and online quality of our engine are easy to control.

So which engine to use as the engine for stream batch unification? After comparing the mainstream big data engines, we chose Flink as the unifying engine for streaming and batching.

Because Flink is a benchmark in the field of stream computing, stream-batch integration has been considered in the architecture design, and it has an active community. And after multiple version iterations, Batch has become available to a certain extent, and we have also had some business implementations in production before.

2. Flink Batch production practice

Next, we will focus on the production and application of Flink Batch. At present, we have stably run 3000+ Flink Batch jobs online, mainly Batch SQL for smooth migration.

At the same time, we provide users with a variety of entry options. Among them, the Batch SQL entry, which uses the Hive dialect for traditional offline production development, is also the focus of my sharing this time.

The Flink Batch entry of the scheduling platform is mainly convenient for users who are familiar with Flink to directly use the Flink dialect or API to develop Batch jobs, and provides complete offline scheduling support. Other entrances are business systems built by business parties based on our platform according to their own needs.

How to use Flink Batch based on Hive dialect in the production environment? These issues need to be addressed.

  • Clarify the online process and standards. First of all, suitable jobs must be screened out, and then various indicators such as data quality, timeliness, and resources must be verified before going online.
  • Solve the problem of syntax compatibility with Hive SQL, and connect to various offline production systems, such as authority center, metadata center, etc.
  • To ensure the stable operation of the production environment, the offline environment is much more complicated than the real-time environment, and some problems that do not exist in the real-time scene will be encountered.
  • Solve the performance gap with the original offline engine; for example, the dynamic partition elimination Sort operator optimization that Mr. Dalong will introduce later.

After all these aspects are resolved, the application can basically be promoted.

Next, briefly introduce Kuaishou's offline production system. At the application layer, there are generally various development platforms, or some business systems. In the service layer, Kuaishou uses HiveServer as the unified entrance of Batch SQL, and uses the Hive dialect uniformly.

BeaconServer can do SQL rewriting, engine routing strategy, and HBO optimization, etc. The engine layer below can be switched freely, so when we connect Flink to offline production, we only need to adapt HiveServer, and then add Flink engine routing rules to BeaconServer.

We currently use SQL-Client to access HiveServer, and may expand the support of SQL Gateway in the future.

After solving the problem of how to access the offline system, we need to clarify the process of going online.

  • The first step is to filter out the Batch SQL that meets the requirements. For example, at the beginning we choose low-priority simple data processing jobs.
  • The second step is to use Flink to parse and verify the SQL to determine whether Flink supports it.
  • The third step is to rewrite the SQL that Flink can run, change the insertion table into a table in the test library, and then submit it for operation.
  • The fourth step is to compare whether the results of the shadow job and the online job are consistent, as well as the resource usage.
  • The fifth step is to switch the jobs that have been successful in the first four steps to the Flink engine, and continue to observe the data quality.

Using the dual-running capability mentioned in the third step, just use the original offline engine for shadow jobs and Flink for online jobs, and then compare the results to ensure that no unexpected problems occur. This step is very important and can help us discover cases that we did not consider in time. Because the online environment is very complicated, more observations are required in the early stage.

Currently, this process has been automated. Our manpower is mainly focused on solving the abnormal cases found;

The following will introduce several key points of this process for your reference.

When I first started using Flink to verify SQL, I found that many commonly used syntaxes were not supported, which felt abnormal. After analysis, it was found that the Flink HiveParser was not really used because of the wrong opening method. By checking the code logic of this block, the problem was found. In Flink, to actually use HiveParser, two conditions need to be met.

  • Use the Hive dialect.
  • The current Catalog must be HiveCatalog, otherwise it will fall back to FlinkParser.

Besides that, you need to make sure that the HiveModule is the highest priority. In this way, functions with the same name in Flink and Hive will be implemented in Hive.

As shown in the figure above, after implementing the method on the right, the passing rate of SQL verification has increased a lot. But there are still many Batch syntax not supported, such as Add Remote JAR and insert directory and so on.

In terms of SQL rewriting, there are generally two situations.

  • There is no table creation statement for the target table in the user job. We will use the CREATE TABLE LIKE statement to first create the target table of the test library. Then, modify the raw SQL to write to the test library.
  • The table creation statement with the target table in the user job. We will directly create a table statement and change it to create a test library. Then, modify the raw SQL to write to the test table.

When executing shadow jobs, you can use an account with low privileges. This account only has the permission to write the test library to avoid writing data to the online library when SQL rewriting fails.

In terms of quality verification, our strategy is as follows. First, according to the job input information recorded by HiveServer, compare whether the input data volume is consistent with the partition data. Then, according to the statistical information of the job, compare whether the amount of written data is consistent. Finally, compare whether the written data results are consistent.

The way we compare is to sum the result data by column. If the results of all columns are consistent, it proves that the data quality is no problem. When summing by column, if it is a Number type column, sum it directly; if it is a non-Number type column, first get the Hashcode, and then sum it.

When the results are consistent, we will compare the resource overhead. Here, YARN's statistical caliber is uniformly used. According to the resources used by each Container * the running time of the Container, the total resources are finally summed.

The standard for online jobs is that there is no problem with the data quality, and the increase in resource usage does not exceed 10% of the original engine, and the execution time does not exceed 20 minutes of the original engine. After introducing the online process of Flink Batch jobs, let's take a look at what work needs to be done to access offline production?

As shown in the image above, some of the modifications we made are listed. It turns out that the configuration of Flink consists of three parts: Flink, Hadoop, and Hive. The configuration management is complicated and not clear enough.

Because we access through HiveServer, when HiveServer starts Flink, it will pass the configuration in Hive Session to Flink. This includes user manual set configuration and Hadoop related configuration. Therefore, we changed the configuration of Flink into two parts, one part is the configuration of Flink itself, and the other part is the configuration of Hadoop and Hive.

SQL-Client will enable the word completion function by default, that is, enter a part of the word, and then use the Tab key to complete the word. This function is no problem in the interactive mode. However, when using a file to pass in SQL, if this happens to happen in the SQL content, it will cause the SQL to change, and an exception that the field cannot be found will appear. So when inputting SQL from a file, you need to turn off the completion function.

Job progress reporting is a very important function for user experience. Otherwise, after the job is submitted, the user cannot see the progress information like Hive/Spark, and HiveServer does not know whether the job is running normally, and the job may always be stuck. So we implemented the progress reporting function. If the progress has not been reported for a long time, HiveServer will actively kill the job.

Finally, monitoring Kanban is really necessary. When analyzing the problem, you can assist in positioning, otherwise you can only guess blindly. In addition, in terms of access to offline production, there is still some work to adapt to platform products.

For example, Kuaishou's offline scheduling platform currently supports 3 dependency methods for releasing empty partitions and SUCCESS FILE functions.

  • task dependent. When the upstream task is successful, the downstream task will be pulled.
  • Partition dependent. Detect whether partition metadata is generated, and start downstream tasks after generation.
  • SUCCESS FILE depends. Depending on whether the file exists, decide whether to pull up the downstream task.

Flink writes the file directory according to the sink to determine which partitions need to be released. There is no problem in the case of dynamic partitions. If it is a static partition write task and no data is generated at the same time, Flink will not publish the partition, which may cause the downstream to not be pulled up. In addition, if the SUCCESS FILE is not written, there will be similar problems.

In terms of collecting statistical information, Flink Batch originally did not collect statistical information. After the partition is generated, the metadata center shows that the data is 0. After seeing it, the user thinks that the job has not been executed successfully, and will rerun the job. If the user configures data quality verification and there is no statistical information, the verification will also fail.

After introducing the access related content, let's take a look at the problems encountered after running online.

We know that offline production is generally T+1. Since the previous day's data is processed after 0 o'clock, a large number of jobs will be scheduled after 0 o'clock, and offline resources will be very tight. The baseline job started at this time may not be able to get resources. In order to ensure that the baseline job is completed on time, YARN will kill some containers with low-quality jobs and allocate resources to the baseline job.

Flink generally fails the job within a period of time when the total number of Task failures reaches the threshold. The offline engine generally fails the same task several times before the job fails, and the offline engine does not count the failure caused by the platform.

When Flink Batch was launched, it encountered the problem of resource preemption. The job will fail after running for a period of time, triggering the failure retry of the scheduling platform, and it will succeed after several retries. Some jobs may not fail, but because the Task is deleted and the data needs to be recalculated, the execution time is prolonged.

To solve this problem, it is not possible to simply increase the Task failure threshold. If you encounter a task failure caused by business logic and increase the failure threshold, the exception will not be detected in time, and serious accidents will occur.

Therefore, we refer to the practice of the offline engine to get the specific failure reason when the Task Fails. If it is due to platform reasons such as resource preemption or machine offline, the number of failures will not be counted. This solves the problem of frequent failed retries of Flink jobs. If users feel that the running time is too long, they need to consider adjusting the job priority.

After solving the problem of resource preemption, the problem of slow nodes in offline clusters is another hidden danger to stability. Excessive CPU utilization and busy IO are very common in offline clusters, and the long tail of individual tasks will lead to a long execution time of the entire job.

The solution to this problem is very simple, just refer to the speculative execution of offline computing. After finding that the execution time of the Task exceeds the average execution time of the same Task for a period of time. The scheduler pulls up a mirrored task on other nodes, and whichever task is executed first will use the data of that task.

This feature was jointly developed by Kuaishou and the community. It should be noted here that data fragmentation in Flink is dynamically allocated, which is different from the static mechanisms of Hive and Spark. Therefore, the implementation complexity of Source's speculative execution will be much higher, and abnormal cases such as resource preemption must also be considered.

With the launch of aggregation jobs, we found that the execution time of some simple aggregation computing tasks is very unstable, sometimes very fast, sometimes abnormally slow. After careful analysis, it is found that Flink uses TaskManager to do Shuffle by default. If Shuffle data is not fully consumed by the downstream, then TaskManager cannot be released. This creates two problems:

  • Waste of resources. An idle TaskManager cannot be released.
  • If resource preemption occurs at this time, or the machine goes offline, and the TaskManager is killed, then the Shuffle data will be gone, and this part of the data needs to be recalculated, which will lengthen the execution time of the job.

In order to solve this problem, Shuffle Service needs to be separated from TaskManager. There are two implementation ideas.

  • Similar to Hive or Spark, use Shuffle Service based on Yarn NodeManager. However, Flink has not yet implemented it, so we need to develop it ourselves.
  • Use Remote Shuffle Service. Flink has an open source implementation, and Kuaishou also has a self-developed Remote Shuffle Service.

After research, we chose the Remote Shuffle Service developed by Kuaishou. Because Kuaishou's Remote Shuffle Service supports Push-Based Shuffle. The Shuffle Service will merge the data of the same Shuffle Partition, and the Task can read all the Shuffle data from one place. The Remote Shuffle Service of the community will also support this function in the future.

Secondly, Kuaishou's Remote Shuffle Service has end-to-end data consistency verification, which has a good guarantee for data quality.

With the increase of migration workload, we face a very difficult problem. The default concurrency setting is not optimal for most jobs.

In real-time computing scenarios, the job concurrency is set by the user. But for offline computing, users do not need to set the concurrency degree, and the engine will automatically calculate the corresponding concurrency degree according to the amount of data. It is not practical for us to manually set the concurrency. Since the amount of data changes every day, the same concurrency cannot be used every day.

If you need to manually set the concurrency, you cannot achieve the goal of smoothly migrating Hive/Spark jobs. We worked with the community to solve this problem. Adaptive Scheduler automatically estimates the appropriate concurrency based on the amount of data, so we don't need to modify user jobs to achieve smooth migration.

In addition, the concurrency of merging small files cannot be accurately estimated by Adaptive Scheduler for the time being. We temporarily solve it through Hack, and the community will expand API support for this special case in the future.

At present, we are gradually increasing the batch jobs of the aggregation class, and encountered two complicated problems, which are being solved together with the community.

  • Hive UDAF support. Currently, Flink only supports Hive UDAF in Partial1 and Final modes, and cannot support functions like Rank for the time being.
  • Hash Agg support. Currently, jobs that use Hive UDAF use Sort Agg, and the performance difference compared to Hash Agg is still obvious.

In order to facilitate the smooth migration of aggregation jobs, the support of Hash Agg and complete Hive UDAF is very necessary.

3. Interpretation of core optimization

Flink Batch encountered many problems in the process of launching Kuaishou, including syntax compatibility, Hive Connector, stability and other aspects. In response to these problems, Kuaishou and the community worked together to solve these problems and successfully promoted the launch of Flink Batch. Next, I will introduce to you the optimization and improvement work done by the community in terms of availability, ease of use, and stability.

Since Flink is standard ANSI SQL, Hive SQL has many syntax differences with ANSI SQL. In order to migrate Hive SQL to the Flink SQL engine, Kuaishou chose to use Hive Dialect. In this case, most of the jobs can be migrated without requiring users to modify SQL. Although before Flink 1.16, the community has done a lot of work on Hive Dialect compatibility. But there is still a gap to be fully compatible with Hive SQL. After Kuaishou selected a batch of jobs to be migrated, through analysis and verification, many unsupported grammars were found.

After Kuaishou gave the input, the community gave priority to support. As shown in the figure above, we list some important and commonly used grammars, such as CTAS, ADD JAR, USING JAR, macro commands, Transform, etc.

UDF is often used in Hive SQL. Generally, users will first add a remote UDF JAR to the job, and then register and use it. In Flink, currently does not support Add JAR, resulting in many jobs cannot be migrated. In addition, algorithm students don't like to write Java UDF. They usually write scripts in python, and then process data through transform. By completing the Hive Dialect syntax, the first block in the migration process is solved. Successfully guaranteed that the existing Hive SQL can run on the Flink engine.

The community has done a lot of work in Flink 1.16 to complete the Hive syntax. At present, through the qtest test, the overall compatibility can reach 95%, which can basically guarantee that the user's existing query can be migrated to Flink. Flink-25592&Flink-26360, these two umbrella issues are tracking work related to Flink Batch. Since the two functions of CTAS&USING JAR involve changes to the PUBLIC API, there are corresponding FLIP design documents in the community, so I will introduce the design of this piece in detail next.

As shown in the figure above, first introduce the FLIP-214 Create Function using JAR function. Because this function involves the modification of the ClassLoader of the SQL module. Therefore, it is necessary to introduce the design ideas to everyone to avoid stepping on some ClassLoader pits.

Anyone who writes SQL knows that due to the variety of business logic, the built-in functions of the computing engine often cannot meet the needs. In this case, users need to write UDF to meet the requirements, especially the big data engine of the Java technology stack. We will put the UDF into a JAR package, and then upload it to a remote HDFS address. When using it, add the JAR first or create a UDF directly based on the JAR package.

Considering this scenario and Kuaishou's business needs, the community supported the USING JAR function in 1.16. The overall grammar part, such as the font marked in red in the PPT, has more USING JAR keywords than before, and allows specifying the address of the JAR package, which can be remote or local. Currently we only support this syntax for Java & Scala languages.

Next, let me introduce in detail how to use the USING JAR function and its execution mechanism. First, register the UDF. In the process of registering the UDF, we will analyze the DDL of the UDF, and first determine whether the function is temporary. If not, it is directly registered in the Catalog without any other extra work. If it is temporary, we will then determine whether the address of the JAR package is a local file or a remote HDFS or OSS address.

If it is a Local JAR, it will check whether the JAR package is legal. If the JAR package is legal, it will add the address of the JAR package to the ResourceManager and also to the MutableURLClassLoader.

An additional explanation is needed here, in order to solve the ClassLoader problem related to Connector&Catalog that often appears in the Flink Table module. In version 1.16, the community introduced a MutableURLClassLoader in the Table module. Each TableEnvironment holds a ClassLoader, allowing dynamic addition of JAR packages to the ClassLoader, which solves the problem of dynamically loading JAR packages.

Next, the JAR package will be registered to FunctionCatalog for management. If the JAR package is at a remote address, there will be an additional download action. This action is completed by ResourceManager, which downloads the JAR package to a local temporary directory and loads it into MutableURLClassLoader at the same time.

The second step is to use UDF. If UDF is used in the query of the job, in the process of query parsing and optimization, it will first judge whether the Function is temporary or persistent. If it is the latter, it will take out the JAR address information from the Catalog, first download the JAR package to the local, and load it into the ClassLoader, then optimize the Query and generate a JobGraph.

After the JobGraph is generated, the third step is to deploy the job to the cluster to run. We need JAR packages when optimizing Query, and also need these JAR packages when running on the cluster, otherwise ClassNotFoundException will occur when the job is running. So how do we do it?

Here we use Flink's BlobServer. When submitting jobs to the cluster, we will first upload all the local JAR packages maintained in the ResourceManager to the BlobServer of the Flink JobManager, which is the part marked by the yellow dotted line in the figure; When the job is executed, it is the responsibility of the TM to pull these JAR packages from the BlobServer.

Next, we introduce another commonly used function CTAS. This syntax is supported in all big data computing engines. Compared with the CREATE TABLE syntax, the difference lies in the font marked in red in the text.

The function of this syntax is that the engine automatically infers the Schema of the target table based on the SELECT Query, and the Catalog is responsible for creating it; it is equivalent to creating the target table first, and then writing an insert into...select query. The biggest advantage is When the query is complex, it avoids the user from handwriting the DDL of the target table and simplifies the user's workload. This function is very useful in the production environment.

Next, introduce the overall implementation process of CTAS. First, the user writes a CTAS Query. During the client compilation and optimization process, we will first derive the schema of the target table based on the Query. Then, serialize the corresponding Catalog. The purpose of serialization is to perform the action of creating a table in order to deserialize it on the JobManager. At the same time, we generate a hook object, which is responsible for calling the Catalog to create the target table in the JobManager.

The second step is the execution of the job. Before the job starts to be scheduled, we first deserialize the hook object and the Catalog object on the JobManager. Next, the catalog is called by the hook to create the target table first. Then, schedule the job.

Assuming the job eventually executes successfully, there is no additional action. If the job execution fails or is manually canceled, for principled considerations, we will call the Catalog through the hook to drop the created target table to ensure that there will be no side effects on the external system in the end.

Considering that Flink is a stream-batch integrated computing engine, the CTAS syntax can be used in both stream-batch scenarios. But generally in streaming scenarios, when a job fails, we will not manually delete the table, but rely on the update capability of the external system to ensure the final consistency of the data.

Therefore, we introduce an atomicity-related option for the user to decide whether to ensure the atomicity of the data. The community has only completed a basic function of CTAS in Flink 1.16, and has not yet supported atomicity. This will be completed in 1.17. For more details, you can refer to the design document of FLIP-218.

During the practice of Kuaishou Flink Batch, we found problems in many aspects of Hive Connector. For example, split computing acceleration, statistical information collection, small file merging, and so on. As shown in the figure above, some functions that are relatively important during use are listed.

Through these optimizations, we have enriched the capabilities of Hive Connector to make it more usable in Batch scenarios. Next, I will introduce dynamic partition write optimization and small file merging in detail.

Unlike writing to static partitions, the user is always required to specify the value of the partition column. Dynamic partitioning allows users not to specify the value of the partition column when writing data.

For example, there is such a partition table: users can use the following SQL statement to write data to the partition table.

In this SQL statement, the user does not specify the value of the partition column, which is a typical example of dynamic partition writing.

In Flink, what is the corresponding generated plan? As shown in the execution plan diagram on the right, there will be four nodes here. Among them, it is worth noting the gray Sort node. When Flink writes to dynamic partitions, it first sorts the data according to the dynamic partition columns, and then writes the data into partitions one by one.

This brings some benefits, but also causes the execution time of the job to become longer. Therefore, in response to this situation and Kuaishou's business scenarios, we have introduced an option that allows users to manually close the Sort node when writing to a dynamic partition, avoiding additional sorting, and speeding up the output of downstream data.

Small file issues are also a very common problem in production environments. When writing to the Hive table, in order to ensure the speed of writing, the concurrency setting of the job is relatively large. While speeding up writing, it also introduces small file issues.

Small files will increase HDFS NameNode pressure and RPC pressure, which is not friendly to downstream reading tasks. In addition, when dynamic partitions are written, a certain concurrency may write many dynamic partitions at the same time, resulting in a large number of small files. Based on the above problems, we write in Hive Batch and support adaptive merging of small files.

The figure above shows the topology of Hive Sink supporting small file merging in Batch mode. We see that there are four nodes in the graph, namely Writer, CompactorCoordinator, Rewriter and PartitionCommitter. The core here is CompactorCoordinator and Rewriter.

The CompactorCoordinator is a single concurrent node. After the upstream Writer finishes writing the file, it tells the CompactorCoordinator the file path information. After the CompactorCoordinator gets all the upstream files, it judges which files are small files and the size of the target files that need to be merged, so as to decide which small files to merge into a large target file.

Then tell the information to the Rewriter, and the Rewriter will complete the merge work, and finally the PartitionCommitter will submit the partition information. The benefit of adaptively merging small files is to reduce the number of files and reduce the pressure on HDFS; improve the data reading efficiency of user jobs; and speed up execution.

Next, talk about the performance problems encountered in the process of using UDAF. First of all, let me introduce the two concepts of Sort-Agg and Hash-Agg. Generally, in aggregation computing scenarios, there are two strategies, namely Sort-Agg and Hash-Agg.

Sort-Agg is to sort the data globally according to group by key before aggregation calculation. After sorting, traverse all the data, and do accumulation when encountering data with the same key. If data with different keys is encountered, it means that all the data of the previous group has been calculated, and the result can be sent downstream directly. Then, calculate the aggregation value of the group corresponding to the new key.

Hash-Agg refers to building a Hash table in memory, the key is the key of group by, and the value is the aggregate value of each group, which has been accumulated upwards. When all the data has been traversed, the final result can be output. Generally speaking, Hash-Agg is completed in memory, which is more efficient, while Sort-Agg requires one-step external sorting, so the performance will be relatively poor.

Currently, there are two aggregate computing function interfaces in Flink, namely ImperativeAggregateFunction and DeclarativeAggregateFunction. The advantages and disadvantages of the corresponding UDAF implementations of the two interfaces are listed on the left side of the figure above.

Currently, Hive UDAF can only adopt the Sort-Agg strategy, and the overall performance is relatively poor. In response to this problem, after research, we decided to re-implement some commonly used UDAFs of Hive in Flink based on the DeclarativeAggregateFunction interface. The hard part here is to make it consistent with Hive's behavior. After the reimplementation, most of the Query can use Hash-Agg, which achieves the same performance as the built-in function on the whole.

Next, let’s talk about another important function, the adaptive scheduler. Users who have written Flink streaming jobs know that Flink jobs need to set the concurrency before going online. For streaming jobs, this is something everyone accepts by default. But for batch jobs, the situation is much more complicated.

First of all, there are a lot of batch jobs, often hundreds, thousands, or even tens of thousands. It is impossible for users to coordinate and concurrency case by case, which is time-consuming and laborious.

Secondly, the amount of data may change every day, which is difficult to predict. Therefore, the same concurrency setting may not always be applicable to the same job. There is no guarantee that the running time of the job will always be within a stable time baseline range, which will have a relatively large impact on production.

Finally, SQL jobs, except Source and Sink, can only be configured with a globally unified parallelism, and fine-grained parallelism cannot be set, and resource waste and additional overhead will also be encountered.

To solve these problems, the community introduced an adaptive batch scheduler for Flink. Through it, the framework will automatically derive the parallelism of the node according to the amount of data that the computing node needs to process.

This kind of parallelism configuration is relatively general and can be applied to most jobs, without having to configure each job separately. The degree of parallelism set automatically can adapt to different data volumes every day. At the same time, since the amount of data that each node actually needs to process can be collected during runtime, fine-grained parallelism can be set. Its process is roughly as follows:

  • When all the execution nodes of the upstream logical node are terminated, we will collect the amount of data produced by them.
  • When the amount of data consumed by downstream logical nodes is determined, we can derive policy components through the degree of parallelism to calculate the appropriate degree of parallelism for the node.
  • After the parallelism of the logical node is determined, we will add its execution node to the execution topology, and try to schedule and deploy it.

The difference from traditional Flink job execution is that the previous job execution topology was built when the job was submitted and was static. For adaptive batch scheduling jobs, the execution topology is dynamically generated. Under the dynamic execution topology, a downstream node can consume multiple sub-partitions, so that the execution process of the upstream node is decoupled from the parallelism of the downstream node.

Adaptive batch scheduling combined with the concurrency derivation capability of Hive Source solves the problem of concurrency setting. The effect obtained on the Kuaishou side is mainly reflected in two aspects:

  • With this feature, users do not need to configure parallelism for each job separately, making Flink Batch easier to use, supporting fine-grained parallelism settings, and avoiding resource waste.
  • According to the amount of data, the concurrency of operators is automatically adjusted to ensure the stable operation of the job in the production environment, ensure the output baseline, and smoothly migrate the job and go online.

Next, I will introduce a relatively important function speculative execution that the community and Kuaishou have collaborated on in terms of production stability. In a production environment, hotspot machines are generally unavoidable. Mixed clusters and intensive refresh data may lead to a high load on a machine and busy IO, making the Flink jobs running on it extremely slow. Some occasional machine abnormalities can also cause the same problem.

These slow tasks will affect the execution time of the entire job, so that the output baseline of the job cannot be guaranteed. Speculative execution is a widely recognized method for solving such problems. Therefore, the community introduced this mechanism in Flink version 1.16.

After speculative execution is enabled, when the framework finds that a SubTask in a batch job is significantly slower than other SubTasks, it will create a new execution instance for it. We call it a shadow task, which is deployed on a normal machine node, while the original slow task instance will be retained and continue to execute.

These shadow tasks have the same inputs and outputs as the corresponding original tasks. Among them, the task completed first will be recognized, and the output data can be used for consumption by downstream nodes; other corresponding instances will be cancelled, and the output data will be cleared.

The specific process of speculative execution is shown in the figure. When SlowTaskDetector finds that there are slow tasks, it will notify the speculative execution scheduler. The scheduler will identify the machine where the slow task is located as a hotspot machine and add it to the blacklist. Then, if the number of running execution instances of the slow task has not reached the upper limit, the scheduler will create a new execution instance for it and deploy it. When any execution instance ends successfully, the scheduler will cancel all other execution instances of the execution node corresponding to the instance.

Earlier we introduced the general speculative execution process implemented at the framework level, but for Source and Sink, speculative execution has some special features. For the Source node, we need to ensure that different concurrent execution instances of the same Source always read the same data. Only in this way can the correctness of the result be guaranteed. There are a few special cases to consider here.

  • For the new source of FLIP-27, the splits on the source side are allocated dynamically. We need to ensure that the shadow task and the original slow task process the same split.
  • Part of the splits that the original slow task has already processed, the shadow task processes faster and can catch up with the previous splits. At this time, the shadow task will request to allocate more splits. This process also requires the original slow task to process the same split. This may be a mutual horse racing process; whoever finishes executing first will use whose data.
  • Due to resource preemption, machine abnormality, etc., shadow tasks or slow tasks may hang up. If only one task is hung up, you can ignore it; if both tasks are hung up, after the speculative execution scheduler determines and recognizes it, it needs to return the processed split information, and then schedule new tasks to process these splits.

Generally speaking, a cache is added to the framework layer to record the data fragments that have been obtained concurrently by each Source, and the fragment information that has been processed by all execution instances under it.

For the sink side, since it is only responsible for writing data, the situation is much simpler. It only needs to ensure that shadow tasks and slow tasks are executed first, and then submit the data of that one, and clean up the data of another invalid sink to avoid data duplication.

Through the speculative execution function, the stability of batch task execution is guaranteed, and the output time is relatively stable and controllable, which ensures the overall stability of Flink batch in Kuaishou production and use, and lays a good foundation for further batch implementation.

The above is to introduce some core optimization and improvement work done by the community and Kuaishou cooperation in Batch from the three aspects of usability, ease of use, and stable production. These tasks ensure that Flink Batch is launched on Kuaishou and runs stably in the production environment. With the further advancement of Kuaishou in Batch, there will be many aspects of work to be done in the future.

4. Future planning

As shown in the figure above, we will continue to invest in the direction of Flink Batch. The display of monitoring indicators and the availability of History Server need to be completed as soon as possible to facilitate problem location and analysis, and users can solve some simple problems by themselves. In addition, when the related problems in the aggregation scenario are solved, we can migrate a large number of aggregation jobs. After solving the problem of the Join scenario, the migration of complex Join scenario jobs will begin;

In the exploration of stream-batch integrated storage, after the engine capacity is built, a unified storage service will be built to provide a unified read-write API for stream jobs and batch jobs, and solve the cost problem caused by redundant storage.

Click to view the original video & speech PPT


 
more content
 
Event recommendation Aliyun's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
0 yuan trial real-time computing Flink version (5000CU* hours, within 3 months)
Learn more about the event: https://click.aliyun.com/m/1000372333/
 
It is infinitely faster than Protocol Buffers. After ten years of open source, Cap'n Proto 1.0 was finally released. The postdoctoral fellow of Huazhong University of Science and Technology reproduced the LK-99 magnetic levitation phenomenon. Loongson Zhongke successfully developed a new generation of processor Loongson 3A6000 miniblink version 108. The world's smallest Chromium core ChromeOS splits the browser and operating system into an independent 1TB solid-state drive on the Tesla China Mall, priced at 2,720 yuan Huawei officially released the security upgrade version of HarmonyOS 4, causing all Electron-based applications to freeze AWS will begin to support IPv4 public network addresses next year Official release of Nim v2.0, an imperative programming language
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/2828172/blog/10088190