How to make Join run faster? (Send the book at the end of the article)

JOIN has always been the biggest problem in database performance optimization. Once a few JOINs are involved, the performance will drop sharply. Moreover, the larger and more tables involved in JOIN, the harder it is to improve performance.

In fact, the key to making JOINs run faster is to classify JOINs. After classification, you can use the characteristics of various types of JOINs for performance optimization.

JOIN classification

Students with SQL development experience know that the vast majority of JOINs are equivalent JOINs, that is, JOINs whose association conditions are equality. Non-equivalent JOINs are much rarer, and most cases can be converted to equivalent JOINs to handle, so we can just talk about equivalent JOINs.

Equivalent JOIN can be mainly divided into two categories: foreign key association and primary key association.

Foreign key association refers to using a non-primary key field of one table to associate the primary key of another table. The former is called a fact table and the latter is a dimension table. For example, in the figure below, the order table is a fact table, and the customer table, product table, and employee table are dimension tables.

imagepng

The foreign key table is a many-to-one relationship, and it is asymmetric, and the positions of the fact table and the dimension table cannot be interchanged. It should be noted that the primary key mentioned here refers to the logical primary key, that is, the field (or field group) that has a unique value in the table and can be used to uniquely determine a certain record. It is not necessary to establish a primary key on the database table. .

Primary key association refers to using the primary key of one table to associate the primary key or part of the primary key of another table. For example, the relationship between customers and VIP customers, order table and order details table in the following figure.

imagepng

Customers and VIP customers are related according to the primary key, and the two tables are the same dimension table. The order is a part of the primary key associated with the details by the primary key. We call the order table the main table and the details table the sub-table.

The same dimension table is a one-to-one relationship. And the same dimension table is symmetrical, the status of the two tables is the same. The master-child table is a one-to-many relationship, and it is asymmetric and has a clear direction.

If you look closely, you will find that both types of JOIN involve the primary key . A JOIN that doesn't involve the primary key results in a many-to-many relationship that doesn't make business sense in most cases. In other words, the above two categories of JOINs cover almost all JOINs that make business sense. If we can use the feature that JOIN always involves the primary key for performance optimization, we can solve these two types of JOIN, which means that most of the JOIN performance problems are solved.

However, SQL's definition of JOIN does not involve the primary key, it is just a Cartesian product of two tables and then filtered according to certain conditions. This definition is simple and broad enough to describe almost everything. However, if JOIN is implemented strictly according to this definition, there is no way to take advantage of the characteristics of the primary key in performance optimization.

SPL changes the definition of JOIN, specifically for these two types of JOIN to deal with separately, using the characteristics of the primary key to reduce the amount of operation, so as to achieve the goal of performance optimization.

Let's take a look at how SPL does it in detail.

foreign key association

If the fact table and dimension table are not too large, they can all be loaded into memory. SPL provides a foreign key address method: first convert the foreign key field value in the fact table to the address of the corresponding dimension table record, and then refer to the dimension table field , you can use the address to directly take out.

Take the previous order table and employee table as an example, assuming that these two tables have been read into memory. The working mechanism of foreign key addressing is as follows: for the eid field of a record r in the order table, find the record corresponding to the eid field value in the employee table, get its memory address a, and then replace the eid field value of r with a . This conversion is done for all records in the order table, and the foreign key addressization is completed. At this time, when the order table record r wants to refer to the employee table field, you can directly use the address a stored in the eid field to retrieve the employee table record and field, which is equivalent to obtaining the employee table field in a constant time, without going to the employee. table to do a lookup.

The fact table and dimension table can be read into memory when the system is started, and foreign key addressization can be done at one time, that is, pre-association . In this way, in the subsequent association calculation, the address in the foreign key field of the fact table can be directly used to fetch the dimension table records to complete the high-performance JOIN calculation.

For the detailed principles of foreign key addressization and pre-association, please refer to: [Performance optimization] 6.1 [Foreign key association] Foreign key addressization

SQL usually uses the HASH algorithm for memory connection, which needs to calculate the HASH value and compare, and the performance will be much worse than directly using the address to read.

The reason why SPL can realize foreign key addressization is to take advantage of the feature that the associated field of the dimension table is the primary key. In the above example, the associated field eid is the primary key of the employee table, which is unique. Each eid in the order table only corresponds to one employee record, so each eid can be converted into the address of the employee record that it uniquely corresponds to.

However, there is no primary key agreement in SQL's definition of JOIN, so it cannot be determined that the dimension table records associated with the foreign key in the fact table are unique, and there may be situations associated with multiple records. For the records in the order table, there is no way for the eid value to uniquely correspond to an employee record, so the foreign key address cannot be achieved. Moreover, SQL does not record the data type of address. As a result, the HASH value needs to be calculated and compared each time it is associated.

When only two tables are JOINed, the difference between foreign key addressing and HASH association is not very obvious. This is because JOIN is not the ultimate goal. There will be many other operations after JOIN, and the proportion of time consumed by JOIN itself is relatively small. But fact tables often have multiple dimension tables, and even dimension tables have many layers. For example, orders are associated with products, products are associated with suppliers, suppliers are associated with cities, cities are associated with countries, and so on. When there are many associated tables, the performance advantage of foreign key addressing will be more obvious.

The following test compares the performance difference between SPL and Oracle when the number of association tables is different. It can be seen that when there are many tables, the advantage of foreign key addressing is quite obvious:

imagepng

For details of the test, please refer to: Performance Optimization Tips: Pre-Association .

For the case where only the dimension table can be loaded into the memory, and the fact table is very large and requires external memory, SPL provides a foreign key serialization method: convert the foreign key field value in the fact table into the serial number of the corresponding record in the dimension table in advance. During the association calculation, the new fact table records are read in batches, and then the corresponding dimension table records are retrieved by serial numbers.

Taking the above order table and product table as examples, it is assumed that the product table has been loaded into the memory, and the order table is stored in the external memory. The process of foreign key serialization is as follows: first read a batch of order data, and suppose that the pid in a record r corresponds to the i-th record of the product table in the memory. We want to convert the pid field value in r to i. After this batch of order records have been converted, the order data is read in batches from the external memory when performing the associated calculation. For the record r in it, you can directly retrieve the corresponding record from the product table in the memory according to the pid value, and avoid the search action.

For a more detailed introduction to the principle of foreign key serialization, refer to: [Performance Optimization] 6.3 [Foreign Key Association] Foreign key serialization .

The database usually reads the small table into the memory, and then reads the large table data in batches. The hash algorithm is used for memory connection, and the hash value and comparison need to be calculated. However, SPL uses serial number positioning to read directly, without any comparison, and the performance advantage is obvious. Although it takes a certain cost to convert the foreign key fields of the fact table into serial numbers in advance, this pre-computation only needs to be done once and can be reused in multiple foreign key associations.

SPL foreign key serialization also takes advantage of the feature that the associated field of the dimension table is the primary key. As mentioned earlier, SQL does not have a primary key convention for the definition of JOIN, and it is impossible to use this feature to serialize foreign keys. In addition, SQL uses the concept of an unordered set. Even if we serialize the foreign key in advance, the database cannot take advantage of this feature, and we cannot use the serial number fast positioning mechanism on the unordered set. The fastest is to use the index to search. Also, the database doesn't know that foreign keys are serialized, and will still calculate hash values ​​and comparisons.

In the following test, in the case of different parallel numbers, comparing the speed of SPL and Oracle in completing the association calculation of large fact tables and small dimension tables, SPL runs 3 to 8 times faster than Oracle. The test results are shown in the following figure:

imagepng

For more detailed information about this test, please refer to: Performance Optimization Tips: Foreign Key Serialization .

If the dimension table is large and requires external memory, but the fact table is small and can fit into memory, SPL provides a large dimension table lookup mechanism. If both dimension and fact tables are large, SPL uses a one-sided heaping algorithm . For the case where dimension tables are filtered and then associated, SPL provides methods such as index multiplexing and alignment sequence .

When the amount of data is so large that distributed computing is required, if the dimension table is small, SPL adopts the mechanism of duplicating the dimension table to copy the dimension table in multiple copies on the cluster nodes; if the dimension table is large, the cluster dimension table method is used to ensure random access. . Both of these methods can effectively avoid the Shuffle action. In contrast, dimension tables cannot be distinguished under the SQL system. The HASH splitting method requires Shuffle action for both tables, and the network transmission volume is much larger.

primary key association

The tables involved in the primary key association are generally relatively large and need to be stored in external memory. SPL provides an orderly merging method for this purpose: the external storage table is stored in order according to the primary key in advance, and the data is retrieved in order for merge calculation when associated.

Take the inner join of the customer and VIP customer tables as an example, assuming that the two tables have been stored in the external memory in an orderly manner according to the primary key cid. When associated, the records are read from the cursors of the two tables, and the cid values ​​are compared one by one. If the cids are equal, the records of the two tables are merged into one record of the result cursor and returned. If they are not equal, the cursor with the smaller cid will read the record and continue to judge. Repeat these actions until the data of any table is fetched, and the returned cursor is the result of the JOIN.

For the association of two large tables, the database usually uses a hash heap algorithm, and the complexity is multiplicative. The complexity of the ordered merge algorithm is additive, and the performance will be much better. Moreover, when the database performs the external memory operation of large data, the hash heap will generate the read and write actions of the cache file. The ordered merge algorithm only needs to traverse the two tables in turn, without using external memory cache, which can greatly reduce the amount of IO and has huge performance advantages.

Although the cost of sorting according to the primary key in advance is high, it can be done at one time. In the future, the merge algorithm can always be used to achieve JOIN, and the performance can be greatly improved. At the same time, SPL also provides a solution to keep the overall order of the data even when there is additional data.

The characteristic of this type of JOIN is that the associated field is the primary key or part of the primary key, and the ordered merge algorithm is designed based on this characteristic. Because no matter whether it is the same dimension table or the main sub-table, the associated fields will not be other fields except the primary key, so we can sort and store the associated tables according to the order of the primary key, and there will be no redundancy. The foreign key association does not have this feature, and ordered merge cannot be used. Specifically, because the associated field of the fact table is not the primary key, there will be multiple foreign key fields to participate in the association, and it is impossible for the same fact table to be ordered by multiple fields at the same time.

SQL's definition of JOIN does not distinguish JOIN types, and does not assume that some JOINs are always for the primary key, so there is no way to use the characteristics of the primary key association from the algorithm level. Moreover, as mentioned earlier, SQL is based on the concept of unordered collections, and the database does not deliberately guarantee the physical ordering of data, so it is difficult to implement an ordered merge algorithm.

The advantage of the ordered merge algorithm is also that it is easy to parallelize in pieces. Taking order and order details as an example, if the two tables are roughly divided into 4 segments according to the number of records, the oid in the 2nd segment of the order may appear in the 3rd segment of the detail, and similar misalignment will lead to wrong calculations result. SPL uses the ordering of the primary key oid again to provide a synchronous segmentation mechanism to solve this problem: first divide the ordered order table into 4 segments, and then find the oid values ​​of the start and end records of each segment to form 4 intervals, and divide the detailed list into 4 segments. Also divided into 4 segments that are synchronized. In this way, the corresponding segments of the two tables will not be dislocated during parallel computing. Since the list is also ordered for oids, it can be quickly positioned according to the start and end oids without reducing the performance of ordered merging.

For the principles of ordered merge and synchronous segment parallelism, see: SPL Ordered Merge Association .

It is more difficult to realize parallelism with the traditional HASH sub-heap technology. When multi-threads do HASH sub-heap, they need to write data to a sub-heap at the same time, resulting in a conflict of shared resources; and the next step to implement a certain group of heap associations will consume a lot of money. memory, cannot implement a large amount of parallelism.

The actual test proves that in the same situation, we do the primary key association test on two large tables (for details, see Performance Optimization Tips: Orderly Merge ), the result is that SPL is nearly 3 times faster than Oracle:

imagepng

In addition to orderly merging, SPL also provides many high-performance algorithms to comprehensively improve the calculation speed of primary key association JOIN. Including: the appendix mechanism, which can store multiple tables in an integrated manner, reducing the amount of stored data, but also equivalent to completing the association in advance, no need to compare again; the association positioning algorithm, which realizes filtering first and then association, which can avoid the whole table traverse, get better performance, etc.

When the amount of data continues to increase and multiple server clusters are required, SPL provides a group table mechanism to distribute the large tables that need to be associated to the cluster nodes according to the primary key. Data with the same primary key resides on the same node, avoiding data transmission between extensions and no shuffle action.

Review and Summary

Reviewing the above two categories and scenarios of JOIN, using the high-performance algorithm provided by SPL in different situations, you can use the characteristics of different types of JOIN to speed up and make JOIN run faster. SQL generally handles the above-mentioned various JOIN scenarios, and there is no way to implement these high-performance algorithms according to the characteristics of different JOINs. For example: when both the fact table and the dimension table are loaded into memory, SQL can only calculate HASH and comparison according to the key value, and cannot use the address to directly correspond; the SQL data table is out of order, and it cannot be merged in an orderly manner when large tables are associated according to the primary key. , you can only use HASH sub-heap, there may be multiple caches, and the performance is uncontrollable to a certain extent.

In terms of parallel computing, it is easy to perform segmented parallelism in SQL single-table computing. Generally, when multi-table association operations are performed, only fixed segmentation can be done in advance, and it is difficult to achieve synchronous dynamic segmentation. Determine the amount of parallelism.

The same is true for cluster operations. In theory, SQL does not distinguish between dimension tables and fact tables. To implement a large table JOIN, a HASH Shuffle action that takes up a lot of network resources will inevitably occur. When there are too many cluster nodes, network transmission delays It will outweigh the benefits of having more nodes.

SPL designs and applies a new operation and storage model, which can solve these problems of SQL in principle and implementation. For different classifications and scenarios of JOIN, programmers can adopt the above-mentioned high-performance algorithms in a targeted manner to obtain faster computing speed and make JOIN run faster.

SPL Information

SPL exchange group

Welcome to add a little helper (VX number: SPL-helper) who are interested in SPL, join the SPL technical exchange group


End of sentence

Complimentary book: "Data Structure and Algorithm Fundamentals" (you can also choose other books)

Number of donated books: 5 Participation methods: Like + favorite + comment on this article WeChat private chat Shi Zhenzhen password: 13 (because it is the 13th issue of Shi Zhenzhen's daily book) After the blogger passes friends, he will pull everyone to a In the small group, the lucky draw will limit the number of participants in order to limit the winning rate to 25%.

You can also choose any

of the following other books 1. Natural language processing NLP from entry to project combat (implemented in Python language)
2. Fundamentals of data structures and algorithms (implemented in Java language)
3. Architecture foundation: from requirements to architecture
4, Metaverse
5, Statistical analysis: using R language as a tool6
, web penetration attack and defense combat7
, Python web development from entry to proficient


1. Easy-to-understand, comprehensive system: the content is advanced from preparatory knowledge → data structure → common algorithm → commercial practice, and teaches you to write data structures and algorithms from scratch;
2. Typical cases, strong practicality: go deep into the JDK source code Explain the implementation principles of data structures and algorithms, use cases to train actual combat, and use codes to implement theories;
3. Additional resources, online Q&A: Not only the case source code is included, but readers can enter the author's personal open source community, communicate with the author at any time, and learn Full cutting-edge programming technology

Guess you like

Origin blog.csdn.net/u010634066/article/details/124266950