Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

The main purpose of this paper is to introduce DataBlocks, problem-solving hybrid database

 

 

 

 Hybrid system difficulty lies in AP and TP in many ways, the optimization idea is contradictory

Such as compression, for ap can improve query performance because of lower bandwidth usage, but for TP but reduces the query performance because the query time required decompression, but also affect the index

 

 

So most of Hybrid strategy system, are providing read-optimized and write-optimized two parts

But this is obviously not very elegant, and merge process is a very heavy operation

So proposed here is that

The table is cut into fixed size chunks with compressed lightweight, immutable datablocks

 

 

 

And in order to improve query speed the introduction of lightweight PSAM index,

 

 

 

Finally look at, if you use the ability to quantify and JIT to improve hybrid query,

Described above, the difference between the quantized and JIT, Cpu reduced number of instructions although both are tuple process, but by the JIT transfer data register, through the main memory vectorized

And vectorized for TP little effect, because the TP tend not to scan data, when touch little data to quantify the effect will be no

 

 

Therefore, this paper has been, fuse ways to quantify and optimize the JIT,

With tuple-at-a-time execution pipeline based on the quantized sub-scan, feeds the data to the interpreter of the JIT compiler

 

 

 

 

 

DataBlocks

Hyper is a memory database, the memory is limited, so you can save memory by compressing

But after compression can affect the performance of TP and AP, DataBlocks is proposed here, in the case of compression, you can also ensure that the performance does not decrease TP and AP

For the first hot data is not compressed, there is no index of SMA, thus ensuring that TP write performance will not be affected,

When the data becomes cold when, will the data chunk into a data block, then how in the case of compression, you can quickly retrieve

We can see DataBlocks mainly includes the following several characteristics,

1. Optimize compression methods to ensure a better compression ratio

2. Use only supports byte-addressable compression method, the only way to quickly jump block, quick retrieval

3. Support SARGable scan, the condition is a simple filtration, it can be directly on the compressed data match, no decompression

4. The index contains SMA and PSMA

Here that the PSMA SMA and are used only for cold data, to avoid affecting the write tp

 

 

Data blocks of the layout, as shown on the right,

The first is the tuple count,

Then the offsets associated attributes of each column

Followed by real data,

Column deposit, so there is a column, a column, row past

Each column contains, SMA, PSMA index, dictionary, data, string

A deposit and then deposit the next column

 

 

 

 

Positional SMAs

SMA, Small Materialized Aggregates, in fact, the largest recorded in each block, minimum

 

 

PSMA, recording max, min can prune to predicate, but if falls max, min, if no other indexes need assistance, you can only scan

So here incorporated PSMA, using a lookup table to record the scan range,

 

 

lookup table as follows,

Such as the right algorithm,

Due to a relatively high accuracy of small value, do so with a delta index,

The principle is to do with the highest non-zero byte index of the index, because the algorithm plus r * 256 low-order byte so all the same, so the highest non-zero value will be the same place to the same entry in the index

So for the 4-byte, 8-th power requires a lookup table entry 4 * 2, since each value of each byte which require one entry

So if we are all relatively large value, it will be concentrated in a certain period of entry, so use delta

 

 

So PSMA-based index tree is smaller than conventional, because it is not accurate index, just give a range

But if the same value range of points is very casual, so PSMA is not so effective, the best situation is the same value of clustered

 

 

 

Attribute Compression

Here the main say, and what can be done compression method of byte-addressable

 

VECTORIZED SCANS IN COMPILING QUERY ENGINES

If you choose the compression algorithm in the block and column level, because it can improve the compression ratio for different data types and distribution, but also lead to different physical representation

This would JIT great challenge, because when the generated code requires compatible with all storage layout, under the more branches, the compiled code will result in an explosion 

 

 So proposed here is that

The Scan separated and processed Pipline

Scan portion with interpreted, while Pipeline section with the JIT, which can perform data obtained by interpreting, pipeline feed to the compiled

 

Guess you like

Origin www.cnblogs.com/fxjwind/p/12576026.html