Flink OLAP query optimization and implementation practice in ByteDance

This article is compiled from the sharing of ByteDance infrastructure engineer He Runkang in the core technology session of Flink Forward Asia 2022. Flink OLAP is an important application of the data warehouse system, supports complex analytical queries, and is widely used in scenarios such as data analysis and business decision-making. This sharing will focus on five aspects: the overall introduction of byte Flink OLAP, query optimization, cluster operation and maintenance and stability construction, benefits and future planning.

 

1. Introduction to byte Flink OLAP

Business landing situation

Byte Flink OLAP has connected to more than 12 core business parties including User Growth, Feishu, E-commerce, and Xingfuli since its launch. The cluster scale has reached more than 16,000 Cores, and the daily query scale exceeds 500,000 times. A single cluster supports 200 QPS during the peak period of complex queries. At the same time, the query latency P99 is controlled within 5s, which better meets the performance requirements of the business.

architecture

The overall architecture of Flink OLAP is divided into two parts: Flink SQL Gateway and Flink Session Cluster.

First, the user submits a Query through the Client, first passes through the SQL analysis and optimization process of the Gateway, generates the execution plan of the job, and then submits it to the JobManager of the Flink Session Cluster. The Dispatcher component of the JobManager will create a corresponding JobMaster, and deploy the Task to the corresponding TaskManager for execution according to specific scheduling rules, and finally return the execution result to the Client.

Flink OLAP is an internal self-developed high-performance HTAP product -- the AP engine of ByteHTAP, which is used to support internal core business. By supporting the deployment of dual computer rooms to improve disaster tolerance, each newly accessed business can vertically deploy two sets of AP clusters in the dual computer rooms. When a serious failure occurs in the online cluster, the flow can be quickly switched to another cluster through the Proxy, thereby improving service availability.

Business Landing Challenges

The application of Flink in streaming scenarios is very mature, and its application in batch scenarios is gradually expanding, but it is less polished and used in OLAP scenarios. Byte Flink OLAP has encountered many problems and challenges in the real business implementation process, mainly divided into challenges to performance and operation and maintenance stability.

A big challenge in terms of performance is that OLAP business requires sub-second job latency, which is very different from streaming batch. Streaming and batch processing mainly focus on the processing speed of data, and do not need to pay attention to the time-consuming stages of plan construction and task initialization. But in OLAP scenarios, it becomes very important to optimize the time consumption of these stages. In addition, byte Flink OLAP is based on the storage-computing separation architecture, which has a stronger demand for operator push-down.

Another challenge is that the OLAP business requires a high QPS, so when the OLAP cluster frequently creates and executes jobs, in some cases it will cause serious performance problems in the cluster, but in the streaming and batch mode, it only needs to be executed once and usually there is no problem. Therefore, in view of the above differences, many query-related optimizations have been carried out in OLAP scenarios, such as plan construction acceleration and initialization and other related optimizations.

In the process of business implementation, OLAP and streaming batch scenarios are very different. O&M, monitoring, and stability all need to be built separately for OLAP scenarios.

In terms of operation and maintenance, OLAP is an online service with high requirements for availability, so it is very necessary to improve the test process and test scenarios, which can reduce the probability of online bugs. In addition, during the operation and maintenance upgrade, unlike the direct restart and upgrade of the streaming batch job, the operation and maintenance upgrade of the OLAP cluster cannot interrupt the user's use, so how to achieve a non-perceived upgrade is a challenge.

In terms of monitoring, in order to ensure the availability of online services, when a problem occurs in the online cluster, it is necessary to recover and locate the fault in a timely manner. Therefore, the monitoring system for OLAP is particularly important. In addition to stream batch cluster status monitoring, the unique slow query analysis and monitoring in OLAP scenarios requires additional construction.

In terms of stability, the first challenge is to build OLAP disaster recovery capabilities. The fault recovery strategies of stream batch and OLAP are different. Stream jobs are recovered through Failover, and batch jobs are recovered through job rerun or failover. Under OLAP, multiple jobs run on an online cluster at the same time. If a single job fails, it can be retried. However, if the entire cluster has an unrecoverable failure, if the recovery is restarted, the time-consuming at the minute level is unacceptable for online services. The second challenge is the governance of Full GC. Streaming batch jobs have a relatively high tolerance for Full GC, but the business under OLAP is very sensitive to latency, and Full GC will also slow down other jobs running at the same time, seriously affecting user experience.

 

2. Query optimization

Query Optimizer optimization

Plan cache

In OLAP scenarios, Query has two typical characteristics: repetitive Query in business and sub-second query time consumption. Through analysis, it is found that the Plan phase takes tens to hundreds of milliseconds, which accounts for a relatively high proportion. Therefore, Plan caching is supported to avoid duplicate Plans for the same Query. In addition, it also supports Catalog Cache to speed up the access to meta information, and also supports ExecNode's parallel Translate, which reduces the time consumption of TPC-DS Plan by about 10%.

operator push down

Under the storage-computing separation architecture, operator pushdown is a very important type of optimization. The core idea is to push some operators down to the storage layer for calculation as much as possible, greatly reducing the amount of Scan data and external IO, and also reducing the amount of data that the Flink engine needs to process, thereby significantly improving the performance of Query.

TopN pushdown: In a typical business within Byte, most queries fetch TopN data. By supporting TopN push-down optimization, the Local SortLimit operator, that is, the Local TopN operator, is pushed down to the Scan node, and the TopN calculation is finally performed on the storage layer, thereby greatly reducing the amount of data read from the storage. After optimization, the amount of read data is reduced by 99.9%, and the latency of business queries is reduced by 90.4%. In addition, it also supports the pushdown of more operators including Aggregate, Filter, Limit, etc.

Pushdown of common operators across Union All: The data of a certain business in Byte is stored in typical sub-databases and tables. In this scenario, if users need to query the full amount of data, they will perform Union All on multiple tables before calculation. At present, Flink Planner lacks support for pushdown of commonly used operators across Union All, causing user queries to read a large amount of data from Source, and processing this data will also occupy a large amount of resources, eventually resulting in high resource consumption and E2E Latency. Therefore, it supports the optimization of common operators pushed down across Union All, including Aggregate, SortLimit and Limit operators.

Taking Aggregate as an example, it can be seen from the figure that before optimization, the downstream of the Union All node is a Local Aggregate node. Since the current Flink Planner does not support cross-Union All operator pushdown, the Local Aggregate node here cannot be pushed down to the upstream of Union All, nor can it be further pushed down to the Scan node, resulting in a large amount of data being read from the storage. After optimization, the Local Aggregate node is pushed upstream of Union All, and finally pushed down to storage for calculation. After optimization, the E2E latency of business queries is reduced by 42%, and the CPU consumption of the Flink cluster is reduced by 30%.

Join Filter pass

Among online business queries, there are many queries with Join, and most of them are Equal Join with a Filter condition. However, because the Filter on one side of the Join is not passed to the other side of the Join, resulting in a large amount of Scan data, which in turn affects query performance.

Therefore, the transfer of Join Filter is supported. It can be seen from the figure above that the Filter t1.id of the t1 table > 1 can be deduced from t2.id > 1 through the Join condition t1.id=t2.id of Equal. Therefore, it can be pushed to the upstream of the t2 Scan node. At the same time, due to the support of Filter transfer, t2.id>1 will eventually be pushed down to storage for calculation, and the data read from the Scan node of t2 will be greatly reduced, thereby improving query performance.

Query Executor Optimization

Classloader reuse optimization

During the continuous operation of the online cluster, we found that the JM / TM process frequently creates Classloaders, resulting in high CPU usage. Through the flame graph analysis, JVM Dictionary::find occupies more than 70% of the CPU. Further analysis of the JVM source code shows that after the JVM loads the class, in order to speed up the search from the class name to the Classloader, it will maintain a hash table named SystemDictionary. When the number of Classloaders is very large, there are a large number of conflicts in the hash table, which makes the search process very slow, and most of the CPU of the entire JM is consumed in this step.

Through positioning, it is found that these Classloaders are UserCodeClassloader, which is used to dynamically load the user's Jar package. It can be seen from the figure that both the JobMaster of the new job and the task of the job on the TM will create a new UserCodeClassloader, resulting in too many Classloaders on the JM and TM. In addition, too many Classloaders will lead to insufficient JVM Metaspace space, which will trigger Metaspace Full GC frequently.

Therefore, the optimization of Classloader reuse is supported, which is divided into two steps: First, optimize the way of relying on Jar packages. Since the dependent third-party Jar packages in the OLAP scenario are relatively fixed, they can be directly placed under the classpath started by JM and TM, and do not need to submit Jar packages separately for each job. Secondly, for each job, when JobMaster and Task are initialized, the System Classloader is directly reused. After optimization, the CPU usage of Dictionary::find in JM dropped from 76% to 1%, and at the same time, the frequency of Metaspace Full GC was significantly reduced.

Codegen cache optimization

In the OLAP scenario, the TM CPU compiled from the Codegen source code takes up a relatively high proportion and takes a lot of time. In order to avoid repeated compilation, the current Codegen cache mechanism will map to the Classloader used by the Task according to the Class Name of the Codegen source code, and then map to the compiled Class, which alleviates the problem to a certain extent. But under the current caching mechanism, there are two obvious problems:

  • The current mechanism only realizes the multiplexing of the same Task with different concurrency within the same job, but there is still repeated compilation for multiple executions of the same Query;
  • Every time you compile and load Class, a new ByteArrayClassloader will be created. Frequent creation of Classloader will lead to severe Metaspace fragmentation and trigger Metaspace Full GC, resulting in time-consuming jitter of the service.

In order to avoid repeated compilation of cross-job code and realize cross-job Class sharing, it is necessary to optimize the cache logic and realize the mapping from the same source code to compiled Class. There are two difficulties:

How to ensure that operators with the same logic generate the same code?

When Codegen code is generated, the self-incrementing ID in the class name and variable name is replaced from the global granularity to the local context granularity, so that operators with the same logic can generate the same code.

How to design the cache key to uniquely identify the same code?

By designing a four-tuple based on Classloader's Hash value + Class Name + code length + code MD5 value. And use it as a cache key to uniquely identify the same code.

The effect of Codegen cache optimization is very obvious. The CPU usage of TM side code compilation is 46% -> 0.3%, the E2E Latency of Query is reduced by 29.2%, and the time of Metaspace Full GC is also reduced by 71.5%.

Anti-sequence optimization

When optimizing the performance of Task deployment, it was found through the flame graph that the CPU usage in the initialization phase of TM Task is relatively high. Further analysis found that when deserializing Task deployment information, there is redundant deserialization of multiple Subtasks of the same Task. Task deployment information TaskInfo mainly includes Head Operator and Chained Operators information, which will be serialized into SerializedUDF and ChainedTaskConfig in TaskInfo respectively during job construction. In order to reduce redundant deserialization, there are two optimization directions:

One is the nested serialization structure of Chained Operators, which mainly removes unnecessary serialization and deserialization of the Map structure, so that multiple Subtasks of the same Task can reuse the same deserialized Map.

The second is the optimization of the Codegen operator. There is also a high deserialization overhead when the Codegen operator with a large proportion is initialized. After analysis, the operator deployment information of this type mainly includes the Codegen source code, but multiple Subtasks on a TM need to deserialize the same source code once, and there is a lot of redundancy. Therefore, the Codegen source code is split out, deserialized separately, and shared with all Subtasks.

The effect of the above deserialization optimization is very obvious. When the number of subtasks of the same Task is equal to 3, the overall serialization and deserialization QPS of TaskInfo are increased by 102% and 163% respectively.

Other more optimization

  • Join Probe early output: Probe / Full Outer Hash Join supports early output of results based on the Bloom Filter on the Build side in the Probe phase, reducing data dumping on the Probe side, thereby improving performance.
  • Memory pooling: When an operator starts, it applies for memory from Managed Memory and initializes memory slices. In the OLAP scenario, this part of the time and resource consumption accounts for a large proportion. Therefore, Cached Memory Pool is supported, that is, the memory pool is shared within the TM dimension, and there is no need to initialize the memory when the operator starts.
  • Memory usage optimization: When executing a Query that contains a large number of Aggregate/Join operators in parallel, it is found that even if the amount of data is very small, the managed memory usage of TM is very high. After investigation, for operators that need to use Managed Memory, the step size of each application memory is 16 MB, so each concurrency of these operators needs to apply for at least 16 MB of memory, resulting in low actual memory utilization, so the configurable step size is supported, and a smaller default value is set to save a lot of memory.

 

3. Cluster operation and maintenance and stability construction

Perfect operation and maintenance system

Build the operation and maintenance release process : After a complete test, use the automated assembly line to release all components that depend on the upstream and downstream in a unified manner, and finally upgrade the online cluster smoothly.

Improve the test method : support CI, accuracy test, performance test, long-term stability and failure test. CI can detect UT failure problems in a timely manner; the accuracy test selects Query-rich TPC-DS test set; performance test mainly includes TPC-H performance test and scheduling QPS performance test; in addition, because online services have relatively high stability requirements, it supports long-term stability and failure tests. When the service runs for a long time and injects various failure scenarios, it can judge whether the status of the cluster and the execution results of the test query are in line with expectations. Among them, the failure test contains a wealth of failure scenarios, including abnormal SQL, JM/TM exit and network failure, etc., which help to find problems such as memory leaks and improve service stability.

Smooth upgrade of online clusters : Supports SQL Gateway rolling upgrades. The specific implementation process is to start a new version of the Flink cluster first, and then switch the multiple Gateway instances online to the new cluster one by one to achieve a non-sensing upgrade, reducing the service interruption time from the previous 5 minutes to close to zero. At the same time, when scrolling and switching, small traffic verification will be performed, and it can be rolled back quickly after a problem is found, reducing the risk of going online.

Perfect monitoring system

In the process of improving the monitoring system, in addition to stream-batch cluster monitoring, such as the monitoring of CPU and other resource usage, GC time and other process status monitoring, fine-grained CPU monitoring is also added to clarify whether there is a CPU bottleneck in the cluster in the case of short queries. At the same time, by adding query load monitoring, it can be judged whether the business load and the load of the Flink cluster are normal.

In addition to cluster monitoring, the unique job monitoring under OLAP is added, and the latency of the whole link is improved, which is convenient for quickly locating the stages where slow queries have time-consuming problems, such as Parse, Optimize, and Job execution stages. In addition, more monitoring of slow queries and failed queries, as well as monitoring of dependent external IO, etc. have been added.

Stability Governance

As an online service, Flink OLAP has high requirements for stability. However, in the initial stage of implementation, due to problems such as lack of disaster recovery and frequent JM/TM FGC, the online stability is poor. We conduct governance from four aspects: HA, current limiting, GC optimization and JM stability improvement.

  • HA : Support dual computer room hot backup to improve the availability of online services. After supporting dual-computer room disaster recovery, it can quickly recover through flow switching. Secondly, by supporting JM HA, it solves the problem of JM single point and improves the availability of online services.
  • Current limit and circuit breaker : Although there is no current limit requirement for jobs in stream mode and batch mode, in OLAP scenarios, users will continue to submit queries. In order to avoid the query peak cluster being hung up, the QPS current limit of SQL Gateway is supported. In order to avoid the problems of high JM and TM load and slow query caused by multiple jobs running at the same time, we limit the maximum number of jobs that the Flink cluster can run. In addition to current limiting, it also supports the use of Failfast's Failover strategy under OLAP to avoid the accumulation of failed jobs and cause cluster avalanches.
  • GC optimization : In OLAP scenarios, the business is very sensitive to latency, and Full GC will cause time-consuming jitter. Therefore, the Full GC of JM and TM is optimized. First remove the Task / Operator level Metric, so that JM's Full GC frequency is reduced by 88%. Secondly, it supports Codegen cache optimization, which reduces the number of Metaspace Full GC of TM to close to zero.
  • JM stability improvement : In OLAP scenarios, JobMaster is supported to remove ZK dependencies, because ZK dependencies will cause job latency jitter under high QPS. At the same time, limit the number of jobs displayed by the Flink UI, because continuously submitting a large number of jobs in the OLAP scenario will cause the entire JM memory to be too large and affect the stability of the JM. At the same time, turn off the automatic refresh of the Flink UI to avoid page jams caused by the increase in JM load caused by automatic refresh.

 

4. Benefits

Benchmark benefits: Through the above-mentioned query optimization of Query Optimizer and Query Executor, in the Benchmark of TPCH 100G, Query Latency is reduced by 50.1% . Secondly, E2E Benchmark was performed on three types of small data queries with different complexity (source-sink, more complicated WordCount, and more complicated three-table join), and the optimization effect was very obvious. The average E2E QPS was increased by 25 times, and the average E2E Latency was reduced by 92%, which was more than 10 times lower .

Business Benefits: Significant improvements in performance and stability. In terms of performance, the average latency of Job is reduced by 48.3%, and the average CPU of TM is reduced by 27.3%. In terms of stability, the frequency of JM Full GC is reduced by 88%, and the time of TM Full GC is reduced by 71.5%.

 

5. Future planning

  • Perfect productization: including support for History Server and intelligent analysis of slow queries.
  • Vectorization engine: Make full use of the parallelization capability of the CPU to improve computing performance.
  • Materialized view: For the calculation of a large amount of data, the time-consuming and resource overhead of checking and calculating now is very large, so in the future, consider introducing materialized views to speed up user queries and save resource usage.
  • Optimizer Evolution: Keep up with the latest developments in the industry and academia, such as AI4DB, which implements SQL Optimization based on Learning-based.

 

RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? The CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released Programmer's Notes CherryTree 1.0.0.0 is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/8652895