BigData – There is also a predicate pushdown in Join!?

This article is published by  NetEase Cloud .

 

In the previous article, I briefly introduced the background of Join in the field of big data and several commonly used algorithms-broadcast hash join, shuffle hash join and sort merge join, etc. The core application scenarios of each algorithm are also related. Introduction, here is another key point: the join between large tables and small tables will use broadcast hash join. Once the small table is slightly larger and no longer suitable for broadcast distribution, shuffle hash join will be selected. Finally, if there are two large tables, you will undoubtedly choose sort merge. join.

 

Okay, here comes the question, so it is said, but which algorithm to choose is ultimately what the SQL execution engine does. According to the above logic, the SQL execution engine must know the size of the two tables participating in the Join in order to choose the best one. algorithm! So dare to ask, how do you know the size of the two tables? Is the size of the two tables measured by the physical size or the number of records or both? In fact, this is another science-Cost Based Optimization (CBO for short), which can not only explain the selection of the Join algorithm, but more importantly, it can also determine the Join sequence in the multi-table joint Join scenario.

 

Are you looking forward to CBO? Well, let's dig a hole here, and let's talk about the next topic. So what are we going to talk about today? Join algorithm selection and Join sequence selection do have a great impact on Join performance, but there is another very important factor that is crucial to Join performance, and that is Join algorithm optimization! Whether it is broadcast hash join, shuffle hash join or sort merge join, they are the most basic join algorithms. Is there any optimization scheme? Really, this is the protagonist to talk about today - Runtime Filter (hereinafter referred to as RF).

 

                                                                     RF Preliminary Knowledge: bloom filter

To put it bluntly, RF uses bloomfilter to filter the tables participating in the join to reduce the amount of data actually participating in the join. In order to explain the whole process in detail below, it is necessary to explain the data structure of bloomfilter (familiar observers can detour). Bloom Filter uses a bit array to implement filtering. In the initial state, each bit of the bit array is 0, as shown in the following figure:

If there is a set S = {x1,x2,...,xn} at this time, Bloom Filter uses k independent hash functions to map each element in the set to the range of {1,...,m}. For any element, the number mapped to is used as the index of the corresponding bit array, and the bit will be set to 1. For example, the element x1 is mapped to the number 8 by the hash function, then the 8th bit of the bit array will be set to 1. In the following figure, the set S has only two elements x and y, which are mapped by three hash functions respectively. The mapped positions are (0, 3, 6) and (4, 7, 10) respectively, and the corresponding bits will be set. is 1:

Now if you want to judge whether another element is in this set, it only needs to be mapped by these three hash functions to see if there is 0 in the corresponding position. If so, it means that this element definitely does not exist in this set, otherwise There may be. As shown in the figure below, it means that z is definitely not in the set {x,y}:

                                                                        RF Algorithm Theory

In order to better illustrate the whole process, here is a complete explanation of the RF algorithm using an SQL example, SQL: select item.name,order.* from order,item where order.item_id = item.id and item.category = 'book' , where order is an order table, item is a commodity table, and the two tables are joined according to the commodity id field. This SQL means to retrieve all order details whose commodity category is books. Assuming that there are not many products with the product type of books, the join algorithm is determined to be broadcast hash join. The whole process is shown in the figure below:

 

 

Step 1: Map the join field ( item.id ) of the item table into a bloomfilter through multiple hash functions (if you don't know about bloomfilter, google it yourself);

 

Step 2: Broadcast the mapped bloomfilter to all partitions in the order table, ready for filtering;

 

Step 3: Taking Partition2 as an example, the storage process (such as the DataNode process) reads the data of the join column (order.item_id) in the order table one by one, and uses bloomfilter to filter. Eliminate the order that the order data is not a book-related product, and this data is skipped directly; otherwise, the order data may be an order to be retrieved, and all the data in this line are scanned;

 

Step 4: Send all order data that has not been filtered out by bloomfilter to the calculation process (impalad) through local socket communication;

 

Step 5: Then broadcast all book product data to all Partition nodes and perform a real hashjoin operation with the order data obtained in step 4 to obtain the final selection result.

 

                                                                              RF Algorithm Analysis

The operation process of the entire RF algorithm in broadcast hash join is briefly demonstrated through an SQL example above. According to the process, a theoretical level analysis of the algorithm is carried out:

 

RF essence: Push down through predicate (bloomfilter), filter data through bloomfilter at the storage layer, and optimize Join from three aspects. First, if many records can be skipped, the number of data IO scans can be reduced. This point needs to be explained. Many friends will have such questions: Since the data needs to be scanned out and filtered by BloomFilter, why reduce the number of IO scans? Here we need to pay attention to a fact: most table storage behaviors are column storage, and the columns are stored independently. Scanning and filtering only needs to scan the join column data (instead of all columns). If a column is filtered out, the other corresponding rows are the same. The columns do not need to be scanned, thus reducing the number of IO scans. Second, the overhead of sending data from the storage layer to the computing layer through sockets (or even TPC) is reduced, and third, the overhead of the final hash join execution is reduced.

 

RF cost: Compared with Broadcast Hash Join without RF, the former mainly increases the three overheads of bloomfilter generation, broadcasting, and large table filtering according to bloomfilter. Under normal circumstances, these steps are not very expensive when the small table is small and can basically be ignored.

 

RF optimization effect: It basically depends on the filtering effect of bloomfilter. If a large amount of data is filtered out, the performance of join will be greatly improved; otherwise, the performance improvement will be limited.

 

RF implementation: Like common predicate pushdown ('=', '>', '<', etc.), RF implementation needs to implement related logic in the computing layer and storage layer respectively. The computing layer needs to construct a bloomfilter and download the bloomfilter. To the storage layer, the storage layer needs to use the bloomfilter to filter the specified data.

 

                                                                                    RF effect verification

In fact, the optimization effect of RF was found in the benchmark comparison test of impala on parquet and impala on kudu by colleague He Dashen in the group. In the actual test, the performance of impala on parquet has obvious advantages over impala on kudu, and the performance is improved by at least 10 times by visual inspection. The performance of the same SQL parsing engine and different storage engines is very different! In order to analyze the specific reasons, my colleagues used impala's execution plan analysis tool to analyze the execution plans of the two respectively, and found through clues that the former uses RF, while the latter does not (of course there may be other factors, but RF is definitely is one of the reasons).

 

Let’s briefly review this test. The benchmark test uses TPCDS test and the data scale is 1T. This article uses a typical SQL (Q40) in the test process as an example to play back and demonstrate the magical effect of RF. The figure below shows the comparative performance of the Q40. Intuitively, RF can directly bring about a 40x performance improvement, 40 times. Hey, how is this achieved?

Let's take a brief look at the SQL statement of Q40, as shown below, which looks complicated. The core involves the join operation of three tables (catalog_sales warehouse join date_dim, catalog_sales join, and catalog_sales join item):

select

    w_state,  i_item_id,  

sum(case when (cast(d_date as date) <    

               cast ('1998-04-08'  as date))    

         then cs_sales_price –                

                 coalesce(cr_refunded_cash,0)

                 else 0 end)  as sales_before,

  sum(case when (cast(d_date as date) >=

                   cast ('1998-04-08' as date))

            then cs_sales_price –

               coalesce(cr_refunded_cash,0)

               else 0 end) as sales_after

from

   catalog_sales left outer join catalog_returns

       on

                   (catalog_sales.cs_order_number =

                catalog_returns.cr_order_number

         and catalog_sales.cs_item_sk =

               catalog_returns.cr_item_sk),

  warehouse, item, date_dim where

   i_current_price between 0.99 and 1.49

   and item.i_item_sk = catalog_sales.cs_item_sk

   and catalog_sales.cs_warehouse_sk =

             warehouse.w_warehouse_sk

    and catalog_sales.cs_sold_date_sk =

date_dim.d_date_sk

    and date_dim.d_date between

            '1998-03-09' and '1998-05-08' group by w_state, i_item_id order by  w_state,  i_item_id limit 100;

 

A typical star structure, where catalog_sales is a fact table and the other tables are latitude tables. This analysis selects the join with the dimension of catalog_sales join item. Because both SQL parsing engines in the comparison test use impala, the SQL execution plans are basically the same. On this basis, let’s take a look at the time-consuming main stages from first to last when a single execution node executes the catalog_sales join item operation in the execution plan, of which only the important time-consuming stages are posted (Join algorithm in Q40 is shuffle hash join, and The broadcast hash join example mentioned above is slightly different, but it does not affect the conclusion):

 

 

After analyzing the execution plans of the two scenarios, the basic theoretical results above can be basically verified:

 

1. Confirm that the data volume of the large table is filtered out after RF, and only a small amount of data is left to participate in the final HashJoin. Referring to the scan results of the large table in the second row, the returned result without rf has 70 million rows+ records, and only 3w+ records meet the conditions after RF filtering. 30,000 compared to 70 million, the performance optimization effect is self-evident.

 

2. After RF filtering, the network time-consuming for a small amount of data to be loaded from the storage process to the computing process memory via the network is greatly reduced. See the third line "Data is loaded into the computing process memory", the former takes 15s, and the latter takes only 11ms. The main time-consuming is divided into two parts, in which the data serialization time accounts for about 2/3-10s, and the data transmission time through RPC accounts for about 1/3-5s.

 

3. Finally, after RF filtering, the number of participants in the final Hash Join is greatly reduced. Hash Join takes 19s for the former and about 21ms for the latter. The main time is the Probe Time of the large table. The former consumes about 17s, and then or only 6ms.

 

                                                                                                                        Say good predicate push down?

 

To tell the truth, when I first came into contact with RF, I thought it was a real artifact, and the admiration was beyond words. However, after a period of exploration and digestion, until I finished writing this article, that is, at this moment, I suddenly felt that it was not inscrutable. To put it bluntly, it was a predicate pushing down. The difference is that the predicate here is a little strange. It's just a bloomfilter.

 

When it comes to predicate pushdown, let’s extend it here. In the past, I often heard predicate pushdowns all over the street, but I always felt ignorant about predicate pushdowns, and I didn't understand it very clearly. After the baptism of RF, I am now convinced that there is a further understanding. Take it out here to communicate with you. Personally think that there are two levels of understanding of predicate pushdown:

 

The first is the logic execution plan optimization level, such as the SQL statement: select * from order, item where item.id = order.item_id and item.category = 'book', under normal circumstances, the Join operation should be executed first after the grammar parsing. Then perform the Filter operation. With predicate pushdown, the Filter operation can be pushed down to be executed before the Join operation. That is, where item.category = 'book' is pushed down to item.id = order.item_id before execution.

 

The second is the real implementation level. The predicate pushdown is to push down the filter conditions from the calculation process to the storage process to execute first. Note that there are two types of processes: the calculation process and the storage process. The idea of ​​separating computing and storage is quite common in the field of big data. For example, the most common computing processes are SparkSQL, Hive, impala, etc., which are responsible for SQL parsing and optimization, data computing aggregation, etc., and storage processes include HDFS (DataNode), Kudu, HBase, Responsible for data storage. Under normal circumstances, all data should be loaded from the storage process to the calculation process, and then filtered and calculated. Predicate pushdown means to push down some filter conditions to the stored process, and directly let the stored process filter out the data. The benefits of this are obvious. The earlier the filtering, the smaller the amount of data, the serialization overhead, network overhead, and computing overhead will be reduced, and the performance will naturally improve.

 

When I wrote this, I suddenly realized that the author made a serious cognitive error above: the RF mechanism is not just a simple predicate push-down, its essence is to propose an important predicate - bloomfilter. Currently, there are not many systems that support RF. The author only knows that only Impala on Parquet supports it. Impala on Kudu Although Impala supports it, Kudu does not. Although SparkSQL on Parqeut has storage system support, the helpless computing engine-SparkSQL does not currently support it.

 

This paper mainly introduces an optimization method similar to semi-join, discusses the optimization details in depth, and discusses my own understanding of the predicate pushdown technology combined with the analysis process. In the follow-up, we will bring issues related to cost optimization (CBO) for you, so stay tuned!

 

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324738982&siteId=291194637