Why is running a batch so difficult?


The detailed data generated by the business system usually needs to be processed and calculated according to a certain logic into the required results to support the business activities of the enterprise. There are usually many such data processing tasks, which need to be completed in batches. In the banking and insurance industries, it is often called running batches. Other industries such as petroleum and electric power often have the need to run batches.

Most of the business statistics require a certain day as the cut-off point, and in order not to affect the operation of the production system, the batch running task is generally carried out at night. Only at this time can the new detailed data generated by the production system on the day be exported and sent to a dedicated The database or data warehouse completes the batch calculation. The next morning, the batch results can be provided to business personnel.

Unlike online query, running batch computing is an offline task that is automatically executed on a regular basis, and there will be no situation where multiple people access a task at the same time, so there is no concurrency problem, and there is no need to return results in real time. However, the running batch must be completed within the specified window time. For example, the batch running window of a bank is from 8:00 in the evening to 7:00 in the morning of the next day. If the batch running task is not completed by 7:00 in the morning, it will cause serious consequences for the business personnel to be unable to work normally.

The amount of data involved in running a batch task is very large, and it is likely to use all historical data, and the calculation logic is complex and there are many steps, so the time for running a batch is often measured in hours. It is common for a task to run for two to three hours. Hours are no surprise. With the development of business, the amount of data is still increasing. The burden of running a batch of databases increases rapidly, and there will be situations where the running can not be completed all night, seriously affecting the user's business, which is unacceptable.

problem analysis

To solve the problem of too long batch running time, we must carefully analyze the problems in the existing system architecture.

The typical architecture of the batch running system is roughly as follows:

From the figure, the data should be taken out from the production database and stored in the running batch database. Batch running databases are usually relational, and stored procedure codes are written to complete running batch calculations. The results of batch running are generally not used directly, but are exported from the batch running database and provided to other systems in the form of interface files, or imported into other system databases. This is a typical architecture, and the production database in the figure may also be a central data warehouse or Hadoop. In general, the production database and the running batch database will not be the same database, and data is often transferred between them in the form of files, which is also more conducive to reducing the degree of coupling. After the batch calculation is completed, the results need to be used by multiple application systems, and are generally transmitted in the form of files.

The first reason why the batch running is very slow is that the relational database used to complete the batch running task enters and exits the database too slowly. Due to the closed storage and computing capabilities of relational databases, too many constraint checks and security processing are required for data entry and exit. When the amount of data is large, the efficiency of writing and reading is very low, and it takes a long time. Therefore, the process of running batch database to import file data, and the process of running batch calculation results and then exporting files will be very slow.

The second reason why running batches is very slow is the poor performance of stored procedures. Because the syntax system of SQL is too old, there are many limitations, and many efficient algorithms cannot be implemented, so the calculation performance of SQL statements in the stored procedure is very unsatisfactory. Moreover, when the business logic is relatively complex, it is difficult to implement it with a single SQL. It often needs to be divided into multiple steps, and it can be completed with a dozen or even dozens of SQL statements. The intermediate results of each SQL must be stored in a temporary table for the subsequent SQL to use. When the amount of data in the temporary table is large, it must be landed, which will cause a large amount of data to be written. The write-out performance of the database is much worse than the read-in performance, which will seriously slow down the entire storage process.

For more complex calculations, it is even difficult to implement them directly with SQL statements, and it is necessary to use database cursors to retrieve data and loop calculations. However, the computing performance of database cursor traversal is much worse than that of SQL statements. Generally, it does not directly support multi-threaded parallel computing. It is difficult to use the computing power of multiple CPU cores, which will make computing performance even worse.

So, is it possible to consider using a distributed database instead of a traditional relational database, and increase the speed of running batch tasks by increasing the number of nodes?

The answer is still no. The main reason is that the logic of running batch calculations is quite complicated. Even with the stored procedures of traditional databases, it is often necessary to write thousands or even tens of thousands of lines of code. However, the stored procedures of distributed databases are still relatively weak in computing power, and it is difficult to achieve such complex running batch calculation.

Moreover, when complex computing tasks have to be divided into multiple steps, distributed databases also face the problem of landing intermediate results. Since the data may be on different nodes, the intermediate results are landed in the previous steps, and when the subsequent steps are read, a large number of read and write operations across the network will be caused, and the performance is very uncontrollable.

At this time, the method of relying on data redundancy to improve query speed in a distributed database cannot be adopted. This is because multiple copies of redundant data can be prepared in advance before querying. However, the intermediate results of running batches are temporarily generated. If it is redundant, multiple copies must be temporarily generated, and the overall performance will only become slower.

Therefore, the actual batch running business is usually carried out using a large single database. When the calculation intensity is too large, an all-in-one machine like ExaData will be used (ExaData is a multi-database, but it has been specially optimized by Oracle, which can be regarded as a super large single database. body database). Although it is very slow, there is no better choice for the time being. Only this type of large database has enough computing power, so it can only be used to complete batch running tasks.

SPL is used to run batches

The open-source professional computing engine SPL provides computing power that does not depend on the database, and directly uses the file system for computing, which can solve the problem that the relational database is too slow to load and exit the database. Moreover, SPL implements a better algorithm, and its performance far exceeds that of stored procedures, which can significantly improve the computing efficiency of a single computer, and is very suitable for running batch computing.

The new architecture of the running batch system implemented by using SPL is as follows:

In the new architecture, SPL solves the two bottlenecks that caused slow batches.

First, let’s look at the problem of data storage and storage. SPL can be calculated directly based on the files exported from the production database, without importing the data into the relational database. After completing the running batch calculation, SPL can also directly store the final result in a common format such as a text file, and pass it to other application systems, avoiding the outbound operation of the original running batch database. In this way, SPL saves the slow process of entering and exiting the relational database.

Let's look at the calculation process again. SPL provides better algorithms (many of which are the first in the industry), and the calculation performance far exceeds that of stored procedures and SQL statements. These high-performance algorithms include:

These high-performance algorithms can be applied to common JOIN calculations, traversal, grouping and summarization in running batch tasks, which can effectively improve the calculation speed. For example, running batch tasks often traverses the entire history table. In some cases, a history table needs to be traversed many times to complete various business logic calculations. The amount of data in the history table is generally large, and each traversal takes a lot of time. At this point, we can apply the traversal multiplexing mechanism of SPL. Only one traversal of the large table can complete multiple calculations at the same time, which can save a lot of time.

SPL's multi-channel cursors can read and calculate data in parallel. Even for complex batch running logic, multiple CPU cores can be used to implement multi-threaded parallel operations. However, database cursors are difficult to parallelize. In this way, the calculation speed of SPL can often reach several times that of stored procedures.

SPL's delayed cursor mechanism can define multiple calculation steps on a cursor, and then let the data flow complete these steps in order to realize chain calculations , which can effectively reduce the number of landing intermediate results. When the data must be landed, SPL can also store the intermediate results in a built-in high-performance data format for use in the next step. SPL high-performance storage is based on files, and uses technologies such as ordered compressed storage, free columnar storage, multiplied segmentation, and its own compression encoding to reduce hard disk occupation, and its read and write speeds are far better than those of databases.

apply effects

In terms of technical architecture, SPL has broken the two bottlenecks of relational running batch databases, and has achieved very good results in practical applications.

The batch running task of L Bank adopts the traditional architecture, uses relational database as the batch running database, and uses stored procedure programming to realize the running batch logic. Among them, the loan agreement storage process needs to be executed for 2 hours, and it is the pre-order task of many other batch running tasks. It takes so long and has a serious impact on the entire batch running task.

After adopting SPL, using high- performance algorithms and storage mechanisms such as high-performance column storage, file cursors, multi-thread parallelism, small result memory grouping, and cursor multiplexing, the original 2-hour calculation time is shortened to 10 minutes, and the performance is improved by 12 times. .

Also, the SPL code is more concise. The original stored procedure was more than 3,300 lines, but after being changed to SPL, there are only 500 statements, and the amount of code is reduced by more than 6 times , which greatly improves the development efficiency.

In the auto insurance business of P Insurance Company, it is necessary to associate new insurance policies with historical policies of previous years, which is called historical policy association tasks in running batches. Originally, the relational database was also used to complete the running batch. The stored procedure calculated the 10-day new insurance policy associated with the historical policy, and the running time was 47 minutes; 30 days required 112 minutes, nearly 2 hours; if the date span is larger, the running time will be longer Unbearable, it basically becomes an impossible task.

After adopting SPL, technologies such as high-performance file storage, file cursors, orderly merge and segment extraction, memory association , and traversal multiplexing are applied, and it only takes 13 minutes to calculate a 10-day new policy; and only 17 minutes to calculate a 30-day new policy , which is nearly 7 times faster . Moreover, the execution time of the new algorithm is not very large with the increase in the number of days of the policy, and it does not increase proportionally like the stored procedure.

From the perspective of the total amount of code, the original stored procedure has 2000 lines of code, and there are more than 1800 lines after removing the comments, while the total code of SPL is less than 500 grids, less than 1/3 of the original .

For the detailed data of loans issued by Bank T through the Internet channel, it needs to perform batch running tasks every day, and collect all historical data before the specified date. The batch running task is implemented using the SQL statement of the relational database, and the total running time is 7.8 hours, which takes up too much time for running the batch, and even affects other running batch tasks, so it must be optimized.

After using SPL, technologies such as high-performance files, file cursors, ordered grouping, ordered associations, delayed cursors, and dichotomy are applied . Originally, it took 7.8 hours to run a batch task, but only 180 seconds for a single thread, and 137 seconds for two threads. seconds, a speed increase of 204 times .

SPL information

Guess you like

Origin blog.csdn.net/m0_63947499/article/details/127354918
Recommended