How to make a column-stored data warehouse more efficient


Many data warehouse products use columnar storage. If the total number of columns in the data table is large and the number of columns involved in the calculation is few, using column storage can only read the required columns, which can reduce the amount of hard disk access and improve performance. Especially when the amount of data is very large, the scanning and reading time of the hard disk accounts for a large proportion. At this time, the advantages of column storage will be obvious.

So, is it possible to achieve the best performance as long as column storage is used? Let's see where columnar storage can be more efficient.

1. Compression

Structured data encoding is generally not very compact, and there is often some room for compression. Data warehouses usually compress data on the basis of column storage, which physically reduces the amount of data storage, thereby reducing read time and improving performance. The data types of the same fields in the data table are generally the same, and even in some cases the values ​​are very close. Such a batch of data usually has a better compression rate. Column storage stores the same field values ​​together, so it is more conducive to data compression than row storage.

However, the general compression algorithm cannot assume that the data has certain characteristics, and can only encode the data as a random byte stream, and sometimes the best compression rate cannot be obtained. Moreover, the data compressed by the high compression rate algorithm often increases the CPU computation and consumes more time during decompression. This part consumes more time, even more than the hard disk read time saved by compression, which is more than the loss.

If we do some processing on the data first, artificially create some data features to use, and then cooperate with the compression algorithm, we can achieve a higher compression rate while keeping the CPU consumption low.

Sorting and storing the data is an efficient way to deal with it. There are often many dimension fields in the data table, such as region, date, etc. The values ​​of these dimensions are basically within a small set range, and there will be many repeated values ​​when the amount of data is large. If the data is sorted by these columns, it is common for adjacent records to have the same value. At this time, a very lightweight compression algorithm can also be used to obtain a good compression ratio. In simple terms, you can directly store column values ​​and their repetitions without having to store the same value multiple times, and the space occupied is considerable.

The order of sorting is also important. Try to sort columns with longer field values ​​first. For example, there are two columns of region and gender, and the number of characters in the region value ("Beijing", "Shanghai", etc.) is greater than that of the gender ("male", "female"), then the effect of sorting first region and then gender is better than reverse situation over.

We can also perform data type optimizations, such as converting strings, dates, etc., to appropriate numeric encodings. If the region and gender fields are converted to small integer numbers, the length of the field value will be the same. At this time, you can select the fields with more repetitions to be ranked first. For example, gender has only two enumeration values, while region has relatively more. Therefore, in each record, there will be more duplicate genders, and the space occupied by the gender first and then the region will usually be smaller.

The column storage solution provided by the open source data computing engine SPL implements this compression algorithm. When appending ordered data to the group table of SPL, the above method is automatically executed by default, and only one value and repeat count are recorded.

SPL establishes an ordered storage group table and completes the writing method of traversal calculation, which is roughly as follows:

Example code 1: ordered compressed column storage and traversal calculation

A
1 =file("T_ordinary.ctx").open().cursor(f1,f2,f3,f4,…).sortx(f1,f2,f3)
2 >file("T.ctx").create(#f1,#f2,#f3,f4,…).append@i(A1)
3 =file("T.ctx").open().cursor().groups(…;sum(amt1),avg(amt2),max(amt3+amt4),…)

A1: Create a cursor of the original data, and sort according to the three fields of f1, f2, and f3.

A2: Create a new group table and specify that the three fields of f1, f2, and f3 are in order. Write the sorted data to the group table.

A3: Open the new group table that has been built and do group summary.

In the following test, after SPL uses data type optimization and ordered compressed column storage, the amount of data storage is reduced by 31%, while the computing performance is improved by more than 9 times. The test results are shown in the following figure:

For more detailed information about this test, please refer to: Multidimensional Analysis Background Practice 3: Dimensional Sorting and Compression

2. Parallel aspects

Multi-threaded parallelism can make full use of the computing power of multiple CPUs and is an important speed-up method. To parallelize, you need to segment the data first. The line storage segmentation is relatively simple. It is roughly averaged according to the amount of data, and then finds the record end mark to determine the location of the segmentation point. But column storage can't take the same approach. Since the different columns of the column store are stored separately, they must also be segmented separately. And because of the existence of variable-length fields and compressed data, the same segmentation point position of each column may not necessarily fall on the same record, which will lead to read errors.

The industry generally adopts the block solution to solve the synchronization problem of column storage and segmentation: the data in the block is stored in the column format, and the segmentation must be in block units, and there is no parallel segmentation in the block. To implement this method, it is necessary to first determine the data size of each block. If the total amount of data in the data table is fixed, and no data will be appended in the future, it is easy to calculate a suitable block size. However, the data table generally has new data added continuously, which will cause a contradiction in how to determine the block size. If the block is large, when the initial total data amount is small, the number of blocks will be relatively small, and flexible segmentation cannot be achieved. Uniform and flexible segmentation is the key to determine the performance of parallel computing. If the block is small, the number of blocks will become large after the data volume increases, and the column data will be physically split into many discrete small blocks, and a small amount of useless data between blocks will be read in. Considering the seek time of the hard disk, the more the number of blocks, the more serious the problem is. Many data warehouses or big data platforms cannot solve the contradiction between the size of the block and the number of blocks, so it is difficult to make full use of parallel computing to improve performance.

SPL provides a multiplication segmentation method, which changes fixed (physical) segmentation into dynamic (logical) segmentation, which can solve this contradiction very well. The specific method is: establish an index area with a fixed size (for example, 1024 index bits) for each column of data, and each index bit stores the starting position of a record, which is equivalent to a record as a block. After appending records until the index bits are filled, rewrite the index area, discard the even index bits, move the odd bits forward, and vacate the last half of the index area. It is equivalent to reducing the number of blocks to 512, and two records are one block. And so on, repeat the process of appending data, filling up, and rewriting the index area. As the amount of data increases, the block size (number of records within a block) doubles continuously. The index areas of all columns should be filled synchronously, and rewritten synchronously after filling, which is always consistent. This method essentially uses the number of records as the segmentation basis, rather than the number of bytes, so it can ensure that each column is synchronized even if it is segmented separately, and there will be no misalignment.

When segmenting by dynamic blocks, the number of blocks is kept between 512 and 1024 (except when the number of records is less than 512), which can meet the requirement of flexible segmentation. The number of records corresponding to the dynamic blocks of each column is exactly the same, which can also meet the requirement of uniform segmentation. Regardless of the amount of data, a good segmentation effect can be obtained. A detailed introduction to the principle of multiplication segmentation can be found here: Multiplication segmentation of SPL .

The group table T generated in example code 1 adopts the multiplication segmentation scheme by default. To use T for parallel computing, simply modify the A3 code:

=file("T.ctx").open().cursor@m().groups(;sum(amt1),avg(amt2),max(amt3+amt4),)

Cursor function plus @m option, you can do parallel computing.

When appending data later, there is no need to regenerate the group table. Open the group table and append it directly. The code is roughly like this:

> file("T.ctx").open().append@i(cs)

Here, it is necessary to ensure that the data to be appended in the cursor cs continues to be ordered according to the three fields of f1, f2, and f3. In practical applications, the data to be appended does not necessarily meet this condition. For this situation, SPL also provides a high-performance solution. For specific methods, please refer to: Ordered storage of SPL .

Find Aspects

Column storage is more suitable for traversal calculations, such as grouping and summarization. For most lookup tasks, column storage results in worse performance. When no index is used, the usual column store cannot use binary search even if it has been stored in order. For this reason, the same as the parallel segmentation described above, or because the column store cannot guarantee the synchronization of each column, misalignment may occur, resulting in read errors. At this time, the column-stored data can only be searched by the traversal method, and the performance will be very poor.

Indexes can also be created on column-stored data tables to avoid traversal, but it is very troublesome. In theory, to record the physical location of each field in the index, the index capacity will be much larger than the index in row storage, and may even be as large as the original data table (because each field has a physical location, The amount of data in the index is the same as the original data, but the data type is simple). Moreover, when reading, the data area of ​​each field must be read separately, and the hard disk has a minimum read unit, which will cause the total read amount of each column to far exceed the row storage, which shows that the search performance is much worse.

After SPL adopts the multiplication segmentation mechanism, the value of each field can be quickly found in the column storage format according to the record serial number, and the dichotomy can be performed. At the same time, the serial number of the entire record can be recorded in the index, and the capacity can be much smaller, which is similar to the row storage. However, when using dichotomy or index search, it is still necessary to read the data blocks of each field separately, and the performance still cannot keep up with the storage. Therefore, if you want to pursue the ultimate search performance, you still need to use row storage. In practical applications, it is best to let the programmer choose whether or not to store it according to the needs of the calculation. However, some data warehouses have a transparent mechanism that does not allow users to freely select row storage and column storage, so it is difficult to achieve the best results.

SPL leaves this degree of freedom to developers, who can decide whether to use column storage and which data to use column storage according to actual needs, so as to obtain extreme performance.

In the previous introduction, the group table uses column storage by default, but also provides row storage mode, which can be specified with the option @r when creating it.

A2 in example code 1 can be changed to:

=file("T_r.ctx").create@r(#f1,#f2,#f3,f4,).append@i(A1)

This is the row storage group table generated. With the two group tables of column storage and row storage, programmers can freely choose and use them according to their needs.

For scenarios with high requirements on traversal and search performance, only storage space can be used in exchange for computing time. That is, the data is redundantly stored twice, column storage is used for traversal, and row storage is used for search. However, the data of this coexistence scheme needs to be redundant twice, and the row storage needs to be indexed again, so the overall hard disk space occupied will be relatively large.

SPL also provides a valued index , which copies other field values ​​together when indexing. The original group table continues to use column storage for traversal, and the index itself has saved field values ​​and uses row storage. Generally, the original table is no longer accessed during search, which can achieve better performance. The valued index, like the row-column coexistence scheme, can take into account the performance of traversal and search. Moreover, the valued index is equivalent to the row storage plus the index, which takes up less space than the row-column coexistence scheme.

Example Code 2: Index with Values

A
1 =file("T.ctx").open()
2 =A1.index(IDS;f1;f4,amt1,amt2)
3 =A1.icursor(f1,f4;f1==123456).fetch()
4 =A1.icursor(f4,amt2;f1>=123456 && f2<=654321)

A2 When creating an index IDS, copy the fields f4, amt1, and amt2 to be referenced in the parameters, and then the values ​​of these fields can be copied in the index. When the target value is retrieved later, as long as the fields involved are in this part, there is no need to read the original table.

Review and Summary

Using column storage can only read the required columns, which can reduce the amount of hard disk access and improve performance when the total number of columns is large and the number of columns involved in the calculation is small. But this is not enough. The column-stored data warehouse also needs to be optimized in terms of data compression, multi-thread parallelism, and search calculations to maximize the effect of column-stored data.

The open source data computing engine SPL makes full use of the characteristics of orderly storage of data, and under the premise of maintaining low CPU consumption, it implements a compression algorithm with a high compression rate, which greatly reduces the amount of physical storage and further improves performance. SPL also provides a multiplication segmentation mechanism, which solves the problem of column storage segmentation, so that column storage data can also make full use of parallel computing to improve efficiency. In addition, SPL can freely establish row storage and column storage data tables, allowing developers to choose and use them independently, and provides a valued indexing mechanism, which can achieve high-performance traversal and search calculations at the same time.

SPL Information

Guess you like

Origin blog.csdn.net/weixin_44953658/article/details/126596511