What if SQL (and stored procedures) run too slowly?

As the most commonly used data processing language, SQL is widely used in scenarios such as query and batch running. When the amount of data is large, the use of SQL (and stored procedures) often runs very slowly, so it is necessary to optimize the SQL. There are some specific routines for optimizing SQL. Usually, you need to check the execution plan to locate the reason for slow SQL, and then rewrite it to optimize SQL. For example, for continuous numerical judgment, you can use between instead of in, select statement to specify the field name, use union all Instead of union, rewrite exists to join, etc. Of course, there are also some engineering optimization methods, such as creating indexes, using temporary tables/summary tables, etc. There are many optimization methods, and I believe that DBAs will not be unfamiliar.

But unfortunately, there are still quite a few cases where it is impossible to run faster no matter how optimized it is. Here [doing SQL performance optimization is really eye-opening] ( http://c.raqsoft.com.cn/article/1638515485216 ) introduced some, and did the corresponding technical analysis. Due to the limitations of its theoretical basis, relational algebra, SQL lacks the support of discreteness and ordered sets, which makes SQL extremely difficult to express some high-performance algorithms, or even impossible to write at all. Watching hardware resources go to waste. In [Writing Simple and Fast Database Language SPL]( http://c.raqsoft.com.cn/article/1641249707028 ) There is a popular explanation of the flaws in the theoretical foundation of SQL. That is to say, the slowness of SQL is theoretical. This kind of problem can only be partially improved by optimizing the database at the engineering level (there are indeed many commercial databases that can automatically identify certain SQL and convert them into high-performance algorithms), rather than fundamentally. (The database optimization engine will be "dizzy" when the situation is complicated, and it can only be executed as a low-performance algorithm according to the writing logic of SQL). Of course, the theoretical defects cannot be solved by replacing the database. As long as SQL is still used, even if the distributed database and in-memory database are used, this is still the case. Of course, it can have certain performance after consuming more expensive resources. improvement, but there is still a huge gap between the performance and the performance that the hardware should be able to achieve.

What else can we do?

Then you can no longer use SQL! You can no longer use relational databases.

For what?

SQL can't describe these high-performance algorithms, can you use Java or C++?

no problem! In theory, any algorithm can be implemented in Java and C++, and because it can control the underlying actions of the computer, this kind of code can usually run very fast (as long as the programmer's ability is not too bad).

However, don't be too happy, although they can be written, but because these development languages ​​are too native, they do not provide any high-performance computing library for data processing. die. Taking hash association as an example, Java implementation needs to write at least several hundred lines of code, not only to design a suitable hash function, but also to solve possible hash collisions. This set of engineering workload is not small; It is not easy to perform multi-threaded programming in the computer, but parallel computing is an effective means to improve computing performance. Similarly, there are many algorithms involved in the calculation of structured data, and the complexity of doing them by yourself can be imagined. If the implementation of a calculation is too complex, and its development cost has far exceeded the performance optimization itself, then there is no meaning of optimization.

Python also faces similar problems. Although it is much richer in structured data computing class library than Java, it does not provide the necessary high-performance algorithm library and storage solution. For example, it does not provide the cursor type and related operations for big data services. , does not provide an efficient parallel mechanism. If you want to implement those high-performance algorithms, you can only develop them yourself, but as an interpreted execution language, Python itself is not very efficient, and the algorithms developed on this basis often fail to meet the high-performance requirements. Similarly, Scala also lacks sufficient high-performance computing class libraries, and the algorithms written by itself are also quite complex. For programmers who are not familiar with these algorithms, the operating efficiency of the code implemented from scratch is often not as good as the optimized commercial database SQL. speed.

Then can only endure the slowness of SQL?

You can also use SPL!

SPL and high performance

Open-source SPL (Structured Process Language), a programming language for structured data processing. Using SPL can make calculations that were originally slow in SQL faster.

Why can SPL run fast? Is there any black technology that changes hardware performance?

Not that. Software cannot change the computing performance of hardware, and neither does SPL. To put it simply, SPL is fast as mentioned above, and a higher-performance algorithm should be used. A large number of basic high-performance algorithm libraries are provided in SPL. The code implemented based on these algorithm libraries can effectively reduce the amount of calculation. We do calculations by combining these algorithms. Each calculation is faster, and the whole will be much faster. , so as to achieve the purpose of improving computing performance.

These high-performance algorithms designed by SPL, such as traversal multiplexing, ordered merge, foreign key pre-association, tag bit dimension, parallel computing, etc., have been packaged. Many of these algorithms are unique to SPL and appear for the first time in the industry.

Based on these encapsulated algorithm libraries, it is very convenient to write programs again. It is not necessary to develop from scratch for direct use, not only high performance, but also fast development. From this point of view, running fast and writing simple are actually the same thing, which is to be able to write high-performance algorithms efficiently. In contrast, Java, C++, Python, and Scala lack these algorithm libraries, and it is difficult to achieve high performance.

Here are some examples of high-performance algorithms in SPL and use cases against SQL:  

[Performance Optimization Skills: Traversal Multiplexing]( http://c.raqsoft.com.cn/article/1568960169923 )  

[Performance Optimization Tips: TopN]( http://c.raqsoft.com.cn/article/1568974653153 )  

[Performance Optimization Tips: Pre-Association]( http://c.raqsoft.com.cn/article/1574142747764 )  

[Performance Optimization Skills: Foreign Key Serialization]( http://c.raqsoft.com.cn/article/1575263621672 )  

[Performance Optimization Tips: Schedule]( http://c.raqsoft.com.cn/article/1582862688002 )  

[Performance Optimization Tips: Unilateral Heaping]( http://c.raqsoft.com.cn/article/1583667643978 )  

[Performance Optimization Skills: Ordered Grouping]( http://c.raqsoft.com.cn/article/1585814654188 )

SPL uses a different concept than SQL to look at the same computing task, and then different (lower) complexity computing methods can be used.

In actual combat, SPL has done a lot of performance optimization cases, ranging from several times faster to dozens of times faster. In extreme cases, there are thousands of times faster. An order of magnitude speedup is basically the norm.

For example, in the case of optimizing the approval of an insurance company's auto insurance policy ([Open source SPL optimizes the approval of an insurance company from 2 hours to 17 minutes] ( http://c.raqsoft.com.cn/article/1594119021002 )), use SPL reduces computation time from 2 hours to 17 minutes, while reducing the amount of code by 2/3. The unique traversal multiplexing technology of SPL is used here, which can realize various operations in one traversal process of big data, and effectively reduce the amount of external memory access. This case involves performing three association and aggregation operations on a large table. Using SQL, the large table needs to be traversed three times, while using SPL only needs to be traversed once, and different methods are used in the association operation, so huge performance is obtained. promote.

In the case of [Open source SPL turns the pre-association of bank mobile phone account inquiry into real-time association] ( http://c.raqsoft.com.cn/article/1595490353934 ), SPL was used to connect mobile phones that could only be pre-linked. Account queries became real-time correlation, and the number of servers was reduced from 6 to 1. The orderly storage mechanism of SPL is fully utilized here. When reading the entire account data at one time, the hard disk time can be effectively reduced (physical storage is continuous), and then the real-time foreign key association technology that distinguishes the dimension table and the fact table can be used to complete the real-time Associative query, the performance is significantly improved, and the hardware requirements are also greatly reduced.

Here are some common business scenarios that can be achieved by using SPL's algorithm library to achieve high performance:

[How to make JOIN run faster? ]( http://c.raqsoft.com.cn/article/1650025817431 )

[How does an in-memory database take advantage of memory? ]( http://c.raqsoft.com.cn/article/1651215810463 )

[How to store data warehouse more efficiently]( http://c.raqsoft.com.cn/article/1652169375363 )

[How to do high concurrent account query? ]( http://c.raqsoft.com.cn/article/1655889820835 )

[What is the key to fast multi-label user portrait analysis? ]( http://c.raqsoft.com.cn/article/1656553554035 )

[Double-dimensional ordered structure speeds up user behavior analysis of large data volumes]( http://c.raqsoft.com.cn/article/1654051775236 )

The reasons for the high performance of SPL are also analyzed in detail in [How to achieve the performance of an order of magnitude faster] ( http://c.raqsoft.com.cn/article/1621898754065 ), and more practical Optimization case for reference.

further discussion

At this point, you may think, is it possible to run the calculation faster by learning the SPL syntax?

It's not that simple!

About Algorithms

Using SPL to get higher performance is not because of the SPL syntax. Although the SPL syntax has some features, it is not the fundamental reason for running fast. The most important thing is to master and use high-performance algorithms.

There are two steps to achieving performance optimization: the first step is to design a low-complexity computing scheme, and the second step is to implement it at a low enough cost. The more critical one is the first step, which needs to be done by programmers with certain experience and knowledge reserves (that is, to master and use high-performance algorithms), and the second step is to use SPL to do it. In other words, SPL is not responsible for designing a solution to a problem, it is only responsible for making the solution easier to implement.

SPL syntax is very simple, much easier than Java, you can basically master it in two hours, and you can be more proficient in two or three weeks. However, the algorithm is not that simple, and it requires careful study and repeated practice to master it. Conversely, as long as the algorithm is mastered, what syntax is used is a relatively minor issue (of course, it is not enough to use a language that is too thick like SQL). This is like seeing a patient, finding out the pathological cause, and analyzing what components of the medicine will work. Whether you buy off-the-shelf medicines directly (using packaged SPL) or go up the mountain to collect medicines (hard-writing in Java/C++), you can cure the disease, but the level of trouble and the cost of paying are different.

Because it is rarely used in practice, many application programmers forget the data structure and algorithm courses they have learned in college after working for a few years, and they cannot design high-performance algorithms without understanding these basic algorithm knowledge. Program. To this end, SPL has set up a special high-performance topic, which not only covers high-performance algorithms and optimization techniques, but also performance optimization courses and performance books to teach people how to fish.

[High-performance computing topic]( http://c.raqsoft.com.cn/article/1647044897121 )

[Performance Optimization Book]( http://c.raqsoft.com.cn/article/1613911172557 )

[Performance Optimization Course]( http://www.raqsoft.com.cn/wx/course-performance-optimizing.html )

 About storage

Closely related to the algorithm, another key point of high-performance computing is data storage. High-performance computing cannot be separated from a reasonable data storage method. When using SPL to implement high-performance computing, it can no longer be done based on the database, and the data needs to be moved out of the database and reorganized.

why?

Slow data computing tasks can be divided into two categories: computationally intensive and data-intensive. Simple computation-intensive tasks involve a small amount of data but a large amount of computation. The large amount of computation is not caused by the large amount of data. In this way, there is no need to change the storage method. As long as a good computing method is implemented, the performance can be greatly improved. In other words, you can continue to use SPL on the original storage methods (such as databases) to optimize performance. Data-intensive tasks involve a large amount of calculation, but the large amount of calculation is mainly caused by the large amount of data. If the storage method is not changed at this time, the data reading time may be very long, even if the calculation time can be optimized. When it reaches 0, the overall operation time cannot be effectively optimized.

Unfortunately, most of the computationally slow scenarios we face are data-intensive. If the data still exists in the database, and the database data access interface (such as JDBC) is usually very slow, it will take a long time to read the data (IO efficiency is very low), which often far exceeds the subsequent SPL calculation time. It is impossible to achieve the optimization effect. Moreover, quite a few algorithms in SPL also have requirements for storage organization. For example, the unilateral stacking algorithm requires an ordered storage method, and conventional relational databases cannot meet this premise, and these algorithms cannot be implemented.

In order to solve this problem, SPL provides its own storage mechanism, directly using the file system to export data from the database to a file of a specific format, which not only can achieve higher IO access efficiency and flexible management capabilities of the file system, but also It can make full use of the data storage advantages of its own format, such as column storage, ordering, compression, and parallel segmentation, so as to effectively exert the effectiveness of high-performance algorithms.

Using files to store data can also effectively reduce data storage time and further improve computing performance. In some computing scenarios, it is necessary not only to read from the data source, but also to implement the calculation results and store them in the database to facilitate subsequent calculations. For example, ETL is a typical calculation that reads and writes concurrently, and some big data calculations or complex calculations need to temporarily store intermediate results, and subsequent calculations need to be used again. We know that database writing is a very slow action, and the performance of computing scenarios accompanying writing is naturally low. At this time, the data that originally needs to be stored in the file can be stored in the file (although this is an advantage in engineering, the read and write performance can still be improved by an order of magnitude), and then directly calculated by using the file computing power of SPL, so as to achieve high performance.

 About T+0

If the data is moved out of the database, is it impossible to complete real-time data calculation? After all, data is always being generated.

no problem.

For full T+0 real-time queries, SPL provides multi-source hybrid computing capabilities to meet such scenarios. The amount of cold data is large and will not change. Use SPL's high-performance file storage, so that higher computing performance can be obtained; the small amount of hot data is still stored in the original data source, and SPL directly reads and calculates (supports diverse data sources). The amount of hot data is not large, and querying directly based on the production data source will not have much impact on it, and the access time will not be too long. Mixed calculation of hot and cold data can obtain T+0 real-time query for full data. As long as we periodically solidify the cold data into the high-performance storage of SPL, the original data source only needs to maintain a small amount of recently generated hot data. The overall structure is as follows:

How to get started

From the previous analysis, it can be known that to complete the performance optimization task, you must be familiar with high-performance algorithms and storage mechanisms, but it can also be seen from the above course books that there is a lot of content, and it is not easy to master them. In particular, many programmers are accustomed to the SQL way of thinking, and it is difficult to jump out of this rut. Faced with a performance optimization task, even with a favorable weapon such as open source SPL, it is often a bit impossible to start. For example, a master wagon driver who wants to go faster will be accustomed to looking for the reins and whips, but will be confused about the steering wheel and accelerator on the first car he sees.

After one or two cases, programmers will be familiar with SPL's way of thinking (understanding the steering wheel and throttle), and it will not be a problem to do performance optimization by themselves in the future.

In the world of martial arts, only fast is invincible, but only by mastering the essence and method of fast can you be invincible. dont you agree?

 SPL Information

- [ SPL Download ]

- [ SPL source code ]

 

 

 

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/126619243