BigData - What exactly is 'cost-based optimization'?

This article is published by  NetEase Cloud .

 

This article specifically discusses an optimization scheme for the basic algorithm of Join - Runtime Filter, and at the end of this article, it also extends to talk about the predicate pushdown technology. At the same time, at the beginning of this article, the author raised two questions, how does the SQL execution engine know the size of the two waves of data sets participating in Join? Is the size of the two waves of datasets measured by physical size or record size or both? This is related to how the SQL parser correctly chooses the Join algorithm. Well, these are the topics that this article will bring to you - Cost-Based Optimization (CBO for short).

 

CBO Fundamentals

 

When it comes to CBO, you have to mention an 'old acquaintance' - Rule-Based Optimization (RBO for short). RBO is an empirical and heuristic optimization idea. The optimization rules are pre-defined, and you only need to apply SQL to these rules. To put it bluntly, RBO is like an experienced old driver who knows all the basic routines.

 

However, there is a thing in the world called - it doesn't follow a routine. It's better to say that it doesn't have a routine. The most typical one is the optimization of complex Join operators. For these Joins, there are usually two multiple-choice questions to do:

 

1. Which algorithm strategy should be chosen to perform Join? BroadcastJoin or ShuffleHashJoin or SortMergeJoin? Different execution strategies have different resource requirements on the system, and the execution efficiency is also very different. For the same SQL, it may only take a few seconds to select the appropriate strategy for execution, but if the appropriate execution strategy is not selected, it may lead to system OOM. .

2. For the snowflake model or star model, what order should the multi-table join be executed in? Different Join sequences mean different execution efficiencies. For example, A join B joins C. Tables A and B are very large, and table C is very small. Obviously, A join B requires a lot of system resources to operate, and the execution time will not be long. short. If the execution order of A join C join B is used, because the C table is small, A join C will get the result quickly, and the result set will be very small, and then use a small result set to join B, the performance will obviously be better than the previous one a scheme.

 

Think about it, are there any fixed optimization rules for this? not at all. To put it bluntly, you need to know more basic information about the table (table size, total number of table records, etc.), and then select an optimal execution plan through a certain rule cost evaluation. CBO means cost-based optimization strategy, which is to select a syntax tree with the least cost from multiple possible syntax trees to execute. In other words, the core of CBO is to evaluate the actual cost of a given syntax tree. For example, the following SQL syntax tree:

 

 

To evaluate the cost of a given whole tree, divide and conquer simply evaluate the cost of each node execution, and finally add up the cost of all nodes. To evaluate the actual cost of executing a single node, you need to know two points. One is the cost rule of this operator. The cost calculation rule of each operator must be different. For example, Merge-Sort Join, Shuffle Hash Join, and GroupBy are all Has its own set of cost calculation algorithms. The second is the basic information of the data set participating in the operation (size, total number of records), such as the size of the two tables actually participating in the Merge-Sort Join, as an important factor in the actual execution cost of the node, of course it is very important. Just imagine, the same Table Scan operation, the execution cost of the large table and the small table must be different.

 

Evaluating the cost of a given operator is also an algorithm in the final analysis, and the algorithm is dead, so I won't list it, and it will be described in detail below. The basic information of the participating datasets is alive. Why do you say this? Because these datasets are all intermediate results after the original table has been filtered and aggregated, and there is no rule to directly tell you how much data this intermediate result has! How to evaluate the basic information of the intermediate results? Derive! Yes, we can know the basic information of the original table. If we can deduce it layer by layer, is it possible to know the intermediate result information requested!

 

Here, the evaluation of the intermediate result information of any node is divided into two sub-problems: first, the basic information of the leaf node (original table) is evaluated, and then it is deduced layer by layer. There is always a way to evaluate the basic information of the original table. To rudely, scan the whole table to obtain the number of records, the maximum value, and the minimum value. In short, it can be done. How can the basic information be deduced layer by layer? rule! For example, the data set information (data set size, etc.) after the original table is filtered by the filter id = 12 can be deduced by certain rules. Different operators have different rules, which are described in detail below!

 

1. The principle of cost-based optimization (CBO) is to calculate the cost of all execution paths and select the execution path with the least cost. The problem turns into: how to calculate the cost of a given execution path;

2. To calculate the execution cost of a given path, it is only necessary to calculate the execution cost of each node on this path, and then add them together. The problem is transformed into: how to calculate the execution cost of any one of the nodes;

3. To calculate the execution cost of any node, you only need to know the cost calculation rule of the current node operator and the basic information of the data set (intermediate result) participating in the calculation (data size, number of data bars, etc.). The problem is transformed into: how to calculate the basic information of the intermediate result and define the operator cost calculation rules;

4. The operator cost calculation rule is a dead rule that can be defined. The basic information of any intermediate result needs to be deduced layer by layer along the syntax tree through the basic information of the original table. The problem turns into: how to calculate the basic information of the original table and define the derivation rules.

 

Obviously, the above process is a thinking process. The real engineering practice is to execute step by step from the bottom to the top, and finally get the execution path with the least cost. Now assemble it from parts to parts:

 

1. First collect the basic information of the original table;

2. Re-define the cardinality evaluation rules of each operator, that is, the basic information change rules of a data set after the operator is executed. After these two steps are completed, the basic data information of all intermediate result sets on the entire execution plan tree can be deduced;

3. Define the execution cost of each operator, combined with the basic information of the intermediate result set, the execution cost of any node can be obtained at this time;

4. Accumulate the cost of all operators on a given execution path to obtain the cost of the entire syntax tree;

5. Calculate all possible syntax tree costs and select the one with the smallest cost.

 

Basic realization ideas of CBO

 

The above has analyzed the implementation ideas of CBO from the theoretical level, split the complete CBO function into multiple sub-functions, and then talked about the implementation of each sub-function.

 

The first step is to collect the basic information of the original table

 

This operation is the most basic work of CBO. The main information collected includes table-level indicators and column-level indicators. As shown below, estimatedSize and rowCount are table-level information, basicStats and Histograms are column-level information, and the latter is more granular. more important for optimization.

 

estimatedSize: The output data size of each LogicalPlan node (decompressed)

rowCount: The total number of output data for each LogicalPlan node

basicStats: basic column information, including column type, Max, Min, number of nulls, number of distinct values, max column length, average column length, etc.

Histograms: Histograms of columns, i.e., equi-width histogram (for numeric and string types) and equi-height histogram (only for numeric types).

 

There are two questions to consider here:

 

1. Why is this information collected? What role does each object play in the optimization process?

2. How do actual projects generally achieve these data collections?

 

Why collect this information? Obviously, the two values ​​of estimatedSize and rowCount are the intuitive reflection of operator cost evaluation. The larger these two values ​​are, the greater the execution cost of a given operator must be, so these two values ​​will be used to evaluate the actual operator in the future. execution cost. What are the basicStats and Histograms used for? Don't forget the original intention. The reason for collecting the information of the original table is to deduce the basic information of all intermediate results one by one along the execution syntax tree. These two It is here to do this. As for how to achieve it, the next section will give an example to explain.

 

How does the actual project realize these data collection? Generally, there are two feasible solutions: open all tables to scan once, which is the easiest, and the statistical information is accurate. sample) for statistical calculation.

 

Systems that support CBO have commands to count raw data information, such as Hive's Analyze command, Impala's Compute command, and Greenplum's Analyze command, etc. However, it should be noted that these commands should not be executed at any time. First, if the table data is not large There is no need to execute it in the case of changes, and secondly, it should not be executed during the high incidence of system queries. Here is a best practice: as far as possible during the off-peak period of the business, execute the statistics command separately on the table with large changes in the table data. There are three key points in this sentence. I wonder if you can see it?

 

The second part, defining the cardinality derivation rules of the core operator

 

Rule derivation means a set of derivation rules for calculating the statistics of the parent node based on the current child node statistics. For different operators, the derivation rules must be different. For example, the evaluation and derivation of fliter, group by, limit, etc. are different. Here we take filter as an example to explain. Let's take a look at such an SQL first: select * from A , C where  A.id = C.c_id and C.c_id > N , the syntax tree after RBO is shown in the following figure:

The problem is defined as: if the basic statistics of table C (estimatedSize, rowCount, basicStats and histograms) are known, how to deduce the basic statistics of the intermediate results after filtering by C.c_id>N? let's see:

 

1. Assume that the minimum value c_id.Min, the maximum value c_id.Max and the total number of rows c_id.Distinct of column C are known, and the data distribution is assumed to be uniform, as shown in the following figure:

 

2. There are three cases to be explained. One is that N is less than c_id.Min, the other is that N is greater than c_id.Max, and the third is that N is between c_id.Min and c_id.Max. The first two scenarios are special cases of the third scenario, and here is a brief description of the third scenario. As shown below:

Under the filter condition of C.c_id>N, c_id.Min will increase to N, and c_id.Max will remain unchanged. And the total number of rows after filtering c_id.distinct(after_filter)=(c_id.Max– N)/(c_id.Max–c_id.Min)*c_id.distinct(before_filter)

 

Simple, but pay attention, the above calculation is done under the assumption that the data distribution is uniform, and the data distribution in the actual scene is obviously impossible to be balanced. The data distribution is usually a probability distribution, and histograms are about to debut here. To put it bluntly, it is a histogram distribution, as shown below:

 

The abscissa of the histogram represents the size distribution of column values, and the ordinate represents the frequency. Assuming that N is in the position shown in the figure, the total number of rows after filtering c_id.distinct(after_filter)=height(>N)/height(All)*c_id.distinct(before_filter).

 

Of course, all the above calculations are only schematic calculations, and the real algorithm will be much more complicated. In addition, if you are interested in the evaluation rules of predicates such as group by and limit, you can read the design document of SparkSQL CBO, which will not be repeated here. So far, the basic statistical information of all intermediate nodes in the syntax tree can be calculated through various evaluation rules and original table statistics. This is the second step in the Long March, and it is also a crucial step. Next, go ahead and see how to calculate the actual cost of each core operator.

 

Because of the implementation reasons, the settings are relatively simple, and some will be more complicated. This section mainly talks about the execution cost of each node. As mentioned above, the total cost of an execution path is the sum of the costs of all nodes on this path.

 

Generally speaking, the actual execution cost of a node is mainly defined from two dimensions: CPU cost and IO cost. For the convenience of subsequent explanations, some basic parameters need to be defined first:

 

Hr: The cost of reading 1byte data from HDFS

Hw: The cost of writing 1byte of data to HDFS

Tr: the total number of data (the number of tuples in the relation)

Tsz: Average size of the tuple in the relation

CPUc : CPU cost for a comparison in nano seconds

NEt: The cost of transferring 1 byte of data between cluster nodes through the network (the average cost of transferring 1 byte over network in the Hadoop cluster from any node to any node)

……

 

As mentioned above, the actual execution cost of each operator is calculated in different ways. It is impossible to list all operators here. We will choose two relatively simple and easy-to-understand ones to analyze. The first one is the Table Scan operator, and the third one is the Table Scan operator. The second is the Hash Join operator.

 

Table Scan operator The Scan operator is generally located at the leaf node of the syntax tree. Intuitively, this type of operator only has the IO cost and the CPU cost is 0. Table Scan Cost = IO Cost = Tr * Tsz * Hr, very simple, Tr * Tsz represents the total size of the data that needs to be scanned, and then multiplied by Hr is the required cost. OK, very intuitive, very simple.

 

The Hash Join operator takes Broadcast Hash Join as an example (if you don’t know the working principle of Broadcast Hash Join, you can click here). Assuming that the large table is distributed on n nodes, the number of data bars and the average size of each node are respectively It is Tr(R1)\Tsz(R1), Tr(R2)\Tsz(R2), … Tr(Rn)\Tsz(Rn), and the number of small table data is Tr(Rsmall)\Tsz(Rsmall), then the CPU The cost and IO cost are:

 

CPU Cost = small table construction Hash Table cost + large table detection cost = Tr(Rsmall) * CPUc + (Tr(R1) + Tr(R2) + … + Tr(Rn)) * N * CPUc, here it is assumed that HashTable is constructed The required CPU resources are much higher than the cost of a simple comparison of the two values, which is N * CPUc

 

IO Cost = Small table scan cost + Small table broadcast cost + Large table scan cost = Tr(Rsmall) * Tsz(Rsmall) * Hr + n * Tr(Rsmall) * Tsz(Rsmall) * NEt + (Tr(R1)* Tsz(R1) + … + Tr(Rn) * Tsz(Rn)) * Hr

 

Obviously, the Hash Join operator is a little more complicated than the Table Scan operator, but no matter which operator, the cost calculation is directly related to the total number of data involved, the average size of the data and other factors, which is why In the first two steps, work tirelessly to calculate the real reason in detail about the intermediate results. It can be described as step by step, interlocking. This is good, the actual cost of any node can be estimated, so the cost of given any execution path must be very simple.

 

Step 4: Select the optimal execution path (the minimum cost execution path)

 

This idea is easy to understand. After the above three steps, the cost of any given path can be easily calculated. Then you only need to find all feasible execution paths and calculate one by one, and you will be able to find an execution path with the least cost, that is, the optimal execution path.

 

This path does seem easy, but it's actually not that easy to do, why? There are too many possible execution paths, all paths are calculated once, and the day lily is cold. So is there any good solution? Of course, in fact, when you see the title - choosing the least expensive execution path, it should be easy to think of - dynamic programming. If you don't think about it, it only means that you haven't read "The Beauty of Mathematics", haven't brushed LeetCode, and haven't played ACM , ACM and LeetCode, if you think it's too boring, then go to "The Beauty of Mathematics", it will tell you how to use dynamic programming to choose the shortest route by driving to Beijing from the current place where you are. It is not repeated here.

 

So far, the author has briefly introduced how the current mainstream SQL engine implements such a seemingly advanced technology as CBO step by step. Next, the author will use the two major SQL engines, Hive and Impala, to open the optimization effect of CBO, so that everyone can have a more intuitive understanding of CBO.

 

3 Hive-CBO optimization effect

 

Hive itself does not implement a SQL optimizer from scratch, but relies on Apache Calcite, an open source, CBO-based enterprise-level SQL query optimization framework. Currently, projects including Hive, Phoniex, Kylin, and Flink use Calcite as the Its execution optimizer is also well understood. The execution optimizer can be abstracted into a system module, and there is no need to spend a lot of time reinventing the wheel.

 

Hortonworks has done relevant tests on the CBO feature of Hive. The test results believe that CBO has at least three important impacts on queries: Join ordering optimization, Bushy join support and Join simplification. This article only briefly introduces Join ordering optimization. Interested students Continue reading this article ( HIVE 0.14 Cost Based Optimizer (CBO) Technical Overview ) to learn more about the other two important impacts. (The data and diagram below are also from this article, which is hereby noted)

 

select dt.d_year,

item.i_brand_id brand_id,

item.i_brand brand,

sum(ss_ext_sales_price) sum_agg from

date_dim dt, store_sales, item where

dt.d_date_sk = store_sales.ss_sold_date_sk and

store_sales.ss_item_sk = item.i_item_sk and

item.i_manufact_id =436 and dt.d_moy =12

group by dt.d_year , item.i_brand , item.i_brand_id

order by dt.d_year , sum_agg desc , brand_id limit 10

 

The above query involves three tables, one fact table store_sales (with a large amount of data) and two dimension tables (with a small amount of data). The relationship between the three tables is shown in the following figure:

 

This involves the Join order problem mentioned above. From the original table, date_dime has 73049 records, while item has 462000 records. Obviously, if there is no other hint, the Join order must be store_sales join date_dim join item. However, there are also two conditions in the where condition. CBO will evaluate the filtered data according to the filter conditions. The results are as follows:

 

According to the above table, the filtered data volume item is significantly smaller than date_dim, and the plot reverses a little faster. Therefore, after CBO, the Join sequence becomes store_sales join item join date_dim. For further confirmation, you can record the SQL execution plan before and after CBO is opened, as shown in the following figure:

 

The figure on the left is the execution plan of Q3 when the CBO feature is not enabled. Store_sales is first joined with date_dim, and the intermediate result data set after joining is 14 billion. Looking at the picture on the right, store_sales joins before item, and the intermediate result is only 8200w. Obviously, the execution efficiency of the latter will be higher. Let’s take a look at the actual execution time of the two:

The above figure clearly shows that the performance of Q3 under the optimization of CBO has nearly doubled, and at the same time, the CPU resource usage has also been reduced by about half. I have to say that there are many similar queries in TPCDS, and interested students can learn more about them.

 

Impala-CBO optimization effect

 

The principle of optimization is the same as that of Hive. It is also optimized for the execution order of complex joins and the optimization of join execution strategy selection. I used TPC-DS to test the performance of some Impala queries before and after the CBO feature was enabled. The test results are as follows As shown in the figure:

 

 

CBO Summary

 

This article actually started to be conceived very early. It took nearly 3 months to write it off and on. I wrote and deleted it. I remember that a lot of content has been written in the second draft. One day I woke up early in the morning. After reading it completely, I found that what I wrote was not what I wanted. To be precise, the writing lacked some order, and it was difficult to correct it, so I simply deleted it. On the other hand, because there are not many complete introductions about CBO on the Internet, I found some English materials, but I still feel that it is still lacking in order and difficult to understand. The first section of this article focuses on bringing you to know CBO from the perspective of thinking. The second section will introduce the whole principle step by step from the perspective of implementation. The third section selects Hive and Impala to compare and introduce the optimization effect after enabling CBO. , so that everyone has a more intuitive feeling.

 

Well, on the topic of Join, I wrote three articles before and after, and what I can see here can only be said to be true love! To be honest, the author has not read the code implementation of RuntimeFilter in its entirety, nor has I systematically learned any CBO code implementation. The content written generally comes from three aspects: official blog documents, analysis and understanding, and roll up sleeves practice. Therefore, the judge should read it critically. It is inevitable that there will be mistakes. I hope to exchange and correct it more. Later, the author will definitely read the relevant code implementation, and I will share with you new discoveries~

 

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324852881&siteId=291194637