MaxCompute understands data, computation and user's brain: a cost-based optimizer

Abstract: Looking back at the major events in the field of big data technology, the earliest date can be traced back to the official launch of Hadoop in 2006. Looking around, around the database and data processing engine, the industry is full of various big data technologies. At the Yunqi Community 2017 Online Technology Summit Big Data Technology Summit, Alibaba Cloud Big Data Computing Platform Architect Lin Wei gave a presentation entitled "The Brain of MaxCompute: Cost-Based Optimizer" to share Alibaba's big data computing with everyone. The brain of the service - the design and architecture of a cost-based optimizer.

For more exciting content, please refer to the big data channel of Yunqi Community https://yq.aliyun.com/big-data; in addition, through Maxcompute and its supporting products, low-cost big data analysis only takes a few steps. For details, visit https:// www.aliyun.com/product/odps.

Abstract: Looking back at major events in the field of big data technology, it can be traced back to the official launch of Hadoop in 2006. Looking around, around the database and data processing engine, the industry is full of various big data technologies. This is a good era for technicians. There are 300+ popular DBs in the database field alone, and the big data processing technologies surrounding the Hadoop ecosystem are even more flourishing. At the Yunqi Community 2017 Online Technology Summit Big Data Technology Summit, Alibaba Cloud Big Data Computing Platform Architect Lin Wei gave a presentation entitled "The Brain of MaxCompute: Cost-Based Optimizer" to share Alibaba's big data computing with everyone. The brain of the service - the design and architecture of a cost-based optimizer.

Introduction to MaxCompute

Big data computing service (MaxCompute) is a fast, fully managed PB/EB-level data warehouse solution. MaxCompute has the ability to expand to 10,000 servers and has cross-regional disaster tolerance. It is Alibaba's internal core big data platform. Most of the computing tasks within the group are supported, and the daily scale of millions of operations is supported. MaxCompute provides users with a complete data import solution and a variety of classic distributed computing models, which can solve users' massive data computing problems more quickly, effectively reduce enterprise costs, and ensure data security.

MaxCompute Architecture


The basic architecture of MaxCompute is shown in the figure above. The bottom layer is the Pangu distributed file storage system built on the physical machine to provide unified storage; the layer above Pangu is the Fuxi distributed scheduling system, which will include CPU All computing resources including memory, network, and disk are managed; the next layer is the unified execution engine, that is, the MaxCompute execution engine; and various computing modes will be created on top of the execution engine, such as stream computing. , graph computing, offline processing, memory computing, machine learning, etc.; on top of this, there will be a layer of related programming languages, that is, the MaxCompute language; Data engineers can develop related applications through the platform, and enable applications to be quickly deployed and run in distributed scenarios.

R&D ideas


of MaxCompute The R&D ideas of MaxCompute are mainly divided into the following four aspects:
high performance, low cost, and large scale. It is hoped that the built MaxCompute platform can improve the high performance of computing, reduce the cost of use for users as much as possible, and can reach the scale of 10,000 machines and multi-clusters.
Stability and service. It is hoped that the MaxCompute platform can provide a stable and service-oriented approach, so that users do not need to think too much about the difficulty of distributed applications, but only need to focus on what kind of calculations users need to perform, so that the system itself can serve users and provide stability. Sexual, service-oriented interface.
Ease of use for data developers. It is hoped that the MaxCompute platform is easy to use and can easily serve data development engineers. It does not require data engineers to have a deep understanding of distributed scenarios, but only needs to focus on what kind of operations need to be performed with these data. The next step is to use the MaxCompute platform to help data development engineers implement their ideas efficiently and at low cost.
Multifunction. It is hoped that MaxCompute can have more functions, not only supporting stream computing, graph computing, batch processing, and machine learning, etc., but also hope that more types of computing can be better supported on the MaxCompute platform.
MaxCompute's Brain - Optimizer
Based on the above R&D ideas, the MaxCompute platform needs to have a more powerful brain. This brain needs to better understand the user's data, the user's calculation, and the user itself. The MaxCompute brain needs to be able to help The user can optimize the operation more efficiently, and understand what kind of operation the user needs to perform through the system level, so as to achieve the various purposes mentioned above, so that the user can be separated from the distributed scene, and there is no need to consider how to make the operation efficient. This is the value that MaxCompute can bring to users.


So what exactly is the brain of MaxCompute? It's actually an optimizer. The optimizer is able to connect all the information together, by understanding the correlation of the data in the system and the user's intentions, and through the ability of the machine to fully analyze the various environments, in the most efficient way in distributed scenarios. For the execution of user operations. In this sharing, offline computing is used as the main example to introduce the optimizer, the brain of MaxCompute.


First, a brief introduction to the concept of offline computing is given. The design of the MaxCompute offline computing architecture is shown in the figure above. At the computing level, there is often a scripting language similar to a high-level language. MaxCompute provides a SQL-like scripting language. The scripting language is submitted through FrontEnd, and then processed and transformed into a logical execution plan. The logical execution plan is in the Optimizer (optimizer) It is translated into a more efficient physical execution plan under the guidance of , and through the connection with Runtime, the Fuxi distributed scheduling system decomposes the physical execution plan into computing nodes for operation.

The core of the above process is how to fully understand the user's core plan and obtain an efficient physical execution plan through optimization. Such a process is called an optimizer. At present, Hive and some Spark optimizers in the open source community are basically rule-based optimizers. In fact, for optimizers, there are such classifications on single-machine systems, which are divided into rule-based optimizers and cost-based optimizations. device.


In the stand-alone scenario, Oracle 6-9i uses a rule-based optimizer, and Oracle 8 has a cost-based optimizer, while Oracle 10g completely replaces the previous rule-based optimizer; while in big data scenarios Inside, Hive only had a rule-based optimizer at the beginning, and the new version of Hive also began to introduce a cost-based optimizer, but Hive is not yet a real cost optimizer. On the other hand, MaxCompute uses a complete cost-based optimizer. So what is the difference between these two optimizers? In fact, the rule-based optimizer theoretically converts rules according to the identification of logical patterns, that is, identifying a pattern may trigger a rule to change the execution plan from A to B, but this method is insensitive to data and optimizes It is locally greedy, just like a mountain climber only sees where is up within 10 meters in front of him, without considering that he should go down first to reach a higher peak, so the rule-based optimizer is easy to fall into local optimization but In a globally poor scenario, it is easy to produce different execution plans due to the order in which the rules are applied, so the result is often not optimal. The cost-based optimizer tries various possible equivalent execution plans through the Volcano volcano model, and then calculates the "cost" of these equivalent execution plans according to the statistical information of the data, and finally selects the execution plan with the lowest cost. This can achieve global optimality.



Here is a specific example to help you understand why rule-based optimizers cannot achieve global optimization. The script in the above figure means that the join is done on A, B, and C first, and the result of the join is grouped on a column and the average value is calculated. The above query process can be drawn as a tree-shaped logical execution plan, which is often bottom-up in the database field, that is, for the logical plan tree, the leaf node is the input, and the final target output is the root node, so in the end The data flow is from bottom to top. It can be seen that in this logical plan, the first is to join the three tables A, B, and C, assuming Size(B)<Size(C)<Size(A), that is, the sizes of the two tables B and C. is smaller than A, so another implementation scheme can be obtained, which is to join B and C first, then join with A, and then calculate the average value. In this way, the intermediate result of B and C join is from The probability will be relatively small above, and the scale of the final result after joining with A will be relatively small, but the average value of the c2 column of B needs to be calculated later, so if the join of B and C is done first, and in the first In the second join, the partition needs to be partitioned according to the join conditions. It needs to be partitioned according to the c1 column of the A table and the c1 column of the B table, but the average value needs to be calculated on the c2 column of the B table later, and a change will be introduced at this time. . Because after the join is done, the partition partition is actually on c1 of table A or table B, but the group by to be done is on c2 of table B, so it is necessary to introduce exchange, which will introduce additional data shuffling. If the size of the three tables A, B, and C is not very different, in fact, you can first join A and B and then join with C, so that the second join happens on the c2 column of B. Therefore, there is no need to introduce additional data shuffling when calculating the average value next. Although the cost of joining is higher than the original, because one data shuffling is omitted, it is more optimized from a global point of view. This example shows that a rule-based optimizer can achieve local optimization, but may not achieve global optimization.



The cost-based optimizer takes a different approach. It first expands the query into multiple equivalent executable plans through a volcano model. In the example, you can let A and B join first and then join C, or let B and C join first and then join A. In these two plans, because there is one more exchange in the following plan, and for the cost-based optimizer, The language will have a Cost cost model at the end. Through calculation, it is found that the first plan is better at Cost, so the optimal plan will be selected for execution. In the cost-based optimizer, a lot of cost models unique to distributed scenarios are made, and Non-SQL is considered. Because many scenarios are Internet-related applications, users need a lot of Non-SQL support, so users can Custom functions help users to achieve some query optimizations combining Non-SQL and relational data, and finally some optimizations for various distributed scenarios. This is also some work done by cost-based optimizers that are different from stand-alone optimizers.



Next, I will share about the Volcano volcano model. In fact, the Volcano model is an engine of the cost model. This engine has actually been proposed on the stand-alone system. There will also be some rules in the Volcano model, but unlike the rules in the rule-based optimizer, the rules here are more like transformation functions. The Volcano model will first group the logical execution plan, and then to complete a task on the group, it will first explore the local expressions in the group, and then apply some transformations according to some rules, these transformations are in principle algebraically equivalent Yes, each time an equivalent change is made, it does not replace the original logical execution plan tree, but splits a new tree on the original basis. So in the end, there will be many equivalent execution plan trees, and finally the best execution plan can be selected by the cost-based optimizer. The principle of the Volcano model is to first hope that each rule is more local, that is, the better the local and orthogonal rules, the more comprehensive the space exploration can be. For example, if four directions are defined on the plane, then you can search for any point on the entire two-dimensional plane through these four directions. The same optimization problem is to select the best plan in the space, then It is hoped that the exploration rules at each change can be orthogonal, so that fewer rules can be used to explore the entire space, so how to explore the space and choose the optimal path for exploration can be handed over to the engine.



The previous sharing is relatively abstract, and here is a further example to illustrate, hoping to deepen your understanding of the optimization process. Suppose there is a very complex logical execution plan tree, which is the real task of the user. Now extract a small part of it. When optimizing the plan, it will first analyze whether there are existing rules that can match the pattern. Suppose two nodes in the graph just match the pattern, one is filter and the other is project. In theory, the filter wants to be pushed to the leaf node, that is, the sooner the filter is performed, the better. Now there is a pattern: if the filter appears in the project Above, that is, you need to do the filter first and then the project, so that you can convert the two nodes into new nodes, that is, you can change the order of the filter and the project, which is the process of applying the rules. . Similarly, there is another node, for example, the aggregate operation can match other patterns, and then the corresponding rules can be found, and the equivalent node operations can be converted, so that the pattern of a tree node can be reused to maintain Multiple trees. In this example, you can see that two rules are used. It seems that there is only one storage on the node, but it actually describes four equivalent trees. After that, the cost of the four equivalent trees will be calculated, and the tree with the lowest cost will be selected as the execution plan. The overall cost-based optimization process is like this, but it can be seen that when the size of the logical plan tree is large and there are many changes in the rules, the entire exploration space is still very large, so many factors need to be carried out for the optimization process. consider.

Next, I will introduce the general algorithm of the optimization engine. The following figure is the simplified optimization engine algorithm, and there are many factors that need to be considered in the actual implementation, which are not shown in the figure below.



First, all logical nodes in a logical execution plan will be registered, and the rules will be matched with the existing logical patterns at the same time of registration, and then the successfully matched rules will be pushed into the rule queue, and then the rules will be popped up cyclically. The rules in the queue, and actually apply the rules. Of course, there are two conditions for applying the rules. One is that an equivalent tree can be generated after application, that is, another tree-like state can be split in a part of the tree, and the split tree may also match other patterns. If all the rules in the local scope have been matched, the cost can be calculated. When the optimal plan is obtained by calculating the cost, you can give up continuing optimization in this part. If you think that the current plan is still not optimal, you can record the cost and continue to optimize other parts of the tree until the end Find the best plan.

Examples of optimization problems

in in distributed systems that are different from those in single-machine systems.



Example 1 is actually very simple. For the join operation on two tables, T1 has been partitioned according to a and b; T2 has been partitioned according to a, and the join condition is T1.a=T2.a. One method is because T1 is partitioned according to a and b, and the join condition is on a, so it is necessary to re-partition T1 according to a and then join with T2. However, if the T1 table is very large, much larger than the size of the T2 table, at this time, we do not want to re-partition T1. Instead, we can adopt another solution, which is to take T2 as a whole and broadcast all the data of T2 to each T1. For a piece of data, because the join condition is to do an inner join on a, such a choice can be made, so that large data can be avoided to be reshuffled. In this scenario, how to perceive the join condition is the key. There is no absolute optimum for the two plans in the example above, but it is necessary to decide which plan is the optimal solution according to the size of the data, the amount of T2 data and the distribution of T1 data shards. The problem has been discussed in many papers on SIFMOD12, and will not be described in detail here.



Let’s share another example of the distributed optimization problem. As shown in the figure, T1 and T2 are still joined on a. After the join is completed, there will be a conditional restriction T1.a>20. After the completion, the project will be carried out and will be completed. The result is treated as a new column b, and finally all the results are expected to be order by b. Both T1 and T2 are range partitions. This is not a hash partition, and because global sort has been performed, the range partition boundary between the two tables can be used here when joining, without the need to reshuffle the data. , for example, we already know which partitions greater than 20 will appear. You can read the corresponding data according to the selected boundary, and you can try to avoid data shuffling. After the join is done, there will be a user-defined method that will The results of this method are sorted according to the rules of order by b. Assuming that the foo() method is a monotonically increasing function, the above conditions can be used, that is, it has been partitioned according to the range. After join and foo(), you can still To save the order of b, you can directly order by b without introducing an exchange. This is a query optimization in distribution, that is, if you can understand the sharding in the data, you can know the distribution of the data, you can also understand the user-defined function methods, and how these methods are carried out with the optimizer. Interaction can be optimized for distributed queries. In fact, through the user's Annotation, you can know what characteristics the user's method has and what data attributes can be maintained.

User-defined function UDF



In distributed systems, especially Non-SQL, a large number of user-defined functions are required for expansion, because many query processes are not as simple as join and aggregate, but many more unique functions need to be modeled, so user-defined functions are required. function implementation. Once there are user-defined functions, it is often difficult for the optimizer to understand UDF, so the scope of optimization will be greatly limited. For example, the middle yellow node in the above figure contains user-defined functions, but the system may not know this. What the function does, then it may be divided into three smaller optimizeable fragments during optimization, and further optimization is performed in three small fragments. If the optimizer can understand what the user-defined function is doing, then the optimizer can penetrate the UDF to achieve a wider range of optimization. So what features does UDF have to help the optimizer penetrate it? In fact, it can be analyzed whether UDF is Row-wise operation, consider whether it is processed line by line, there is no cross-line, consider whether UDF is a monotonic function, and whether some columns are invariant during processing, that is, can be penetrated , whether it can keep data sharding or sorting, and some information on Cost, whether its Selectivity is high or low, and whether the data distribution of output is more or less, etc. can be better optimized by the optimizer, for optimization The controller opens up a larger optimization space, realizes more flexible optimization, and helps the Cost model to select a better solution. This is also some of the work that Alibaba is currently doing on the MaxCompute optimizer.

Optimization rules

MaxCompute's cost-based optimizer has done a lot of optimizations to achieve various optimizations as shown in the following figure, which will not be described here. It can be seen from the figure below that there are many optimizations that can be done in the query. All these optimizations are operators on the entire system engine. These operators are also changing the graph, resulting in many equivalent trees. The optimized engine selects the best solution through the Cost model.



Cost model

What is the Cost model? In fact, the cost model most needs to pay attention to is its own cost model. Each Cost model needs to focus on the local part, such as what kind of Cost is the input, and what kind of Cost will be obtained after joining, instead of paying attention to the global plan, the Cost of the global scheme is the overall Cost obtained by the engine through accumulation . A good cost model strives to reflect the objective physical implementation. The cost model does not need to be exactly the same as the real one. The ultimate purpose of the cost model is to distinguish the pros and cons of the plan. It only needs to be able to select the better plan, and does not need the absolute cost of the cost. What properties does the value have. The cost model of the traditional database is still a long time ago model. Even if the hardware structure has changed, as long as it is still the von Neumann architecture and the architecture has not changed, the Cost model can be used to select the optimal solution.



In fact, the optimizer has many other factors to consider. For example, in terms of rules, it is necessary to perform equivalent transformations according to the rules, and finally select the optimal solution according to the Cost model. As the scale of the logical plan increases, it will take a lot of time to enumerate all possible solutions. Especially on MaxCompute, it is hoped that the larger the logical execution plan, the better, because this will give the optimization engine more space, but this This means that when enumerating all the plans, some of the enumerated plans are actually unnecessary and may already be in an unoptimized situation. So how to do effective pruning and how to avoid unnecessary exploration space is also what needs to be considered when implementing a good optimizer. In addition, for the choice of exploration space, time can be spent on the space that is most likely to be the optimal plan, which may be a better choice, because it is impossible to choose the optimal plan through NP-hard time, and It should be hoped that a better execution plan should be selected in a limited time, so in the field of optimization, it is not necessary to find the best solution, but to avoid the worst solution, because there will always be time constraints in optimization.

Why is the cost-based optimizer becoming more and more important for the MaxCompute platform?


This is because Alibaba hopes to provide more complex stored procedures from Hive's query statements. As shown in the above figure, more complex query procedures and stored procedures can be written through variable assignment and preprocessing if-else, while the rule-based optimizer will become more and more biased due to the greedy algorithm, and it is very likely that it will eventually fail. To the global optimal solution, the complexity of the logical plan makes the space that can be optimized larger, but it also makes the requirements for the optimizer higher, so a better cost-based optimizer is needed to help choose a better one Implementation plan. In new scenarios such as distributed and Non-SQL, the use of cost-based optimizers is different from traditional stand-alone optimizers, so a deeper understanding of data, operations and users is required to make cost-based optimizers work. smarter.

Understanding Data


So , let's take a look at what it means to understand data. In terms of data format, understanding data requires understanding more data indexes and heterogeneous data types, including structured data, unstructured data and semi-structured data. In the big data scenario The data inside has some power-law attributes, and there are tables with millions of sparse columns, which needs to achieve a better optimization in such a scenario; understanding the data also requires understanding the rich data sharding methods, which are in distributed scenarios. Only in China, data sharding can be Range/Hash/DirectHash, and storage can be Columnstorage/Columngrouping, and Hierarchy Partition is required for hierarchical partitioning; it is also necessary to understand complete data statistics and runtime data, Need to understand Histogram, Distinct value and Data Volume and so on.

understand operations


In terms of understanding operations, it is necessary to better understand user-defined functions, to be able to interact with the optimizer, and to allow users to display the properties of data attributes in operations through Annotation, so that global optimization can be performed. There will also be more optimizations at runtime. For example, when the operation reaches a certain stage in the middle, it is necessary to judge the size of the data volume, select the parallelization according to the size of the data volume, and select the optimization strategy on the network topology according to the location of the data. . It can also balance between real-time, scale, performance, cost, and reliability, and use network shuffling to do memory computing and stream computing.

Understanding users


From the perspective of understanding users, it is necessary to understand the user scenarios on the optimizer, and understand the different needs of users in terms of scale, performance, latency, and cost in a multi-tenant scenario, and let the optimizer choose the best one in such a scenario. Solution; in terms of ecology, the optimizer is the core optimization engine. I hope to be more open in language, I hope to connect with more languages ​​and ecology, and I hope to provide a powerful IDE to provide developers with complete Finally, we hope to provide a variety of computing modes on a unified platform, so that the optimizer can truly become the brain of computing.

Original link: https://yq.aliyun.com/articles/72240?spm=a2c41.11181499.0.0

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326174669&siteId=291194637