The main purpose of this paper is to introduce DataBlocks, problem-solving hybrid database
Hybrid system difficulty lies in AP and TP in many ways, the optimization idea is contradictory
Such as compression, for ap can improve query performance because of lower bandwidth usage, but for TP but reduces the query performance because the query time required decompression, but also affect the index
So most of Hybrid strategy system, are providing read-optimized and write-optimized two parts
But this is obviously not very elegant, and merge process is a very heavy operation
So proposed here is that
The table is cut into fixed size chunks with compressed lightweight, immutable datablocks
And in order to improve query speed the introduction of lightweight PSAM index,
Finally look at, if you use the ability to quantify and JIT to improve hybrid query,
Described above, the difference between the quantized and JIT, Cpu reduced number of instructions although both are tuple process, but by the JIT transfer data register, through the main memory vectorized
And vectorized for TP little effect, because the TP tend not to scan data, when touch little data to quantify the effect will be no
Therefore, this paper has been, fuse ways to quantify and optimize the JIT,
With tuple-at-a-time execution pipeline based on the quantized sub-scan, feeds the data to the interpreter of the JIT compiler
DataBlocks
Hyper is a memory database, the memory is limited, so you can save memory by compressing
But after compression can affect the performance of TP and AP, DataBlocks is proposed here, in the case of compression, you can also ensure that the performance does not decrease TP and AP
For the first hot data is not compressed, there is no index of SMA, thus ensuring that TP write performance will not be affected,
When the data becomes cold when, will the data chunk into a data block, then how in the case of compression, you can quickly retrieve
We can see DataBlocks mainly includes the following several characteristics,
1. Optimize compression methods to ensure a better compression ratio
2. Use only supports byte-addressable compression method, the only way to quickly jump block, quick retrieval
3. Support SARGable scan, the condition is a simple filtration, it can be directly on the compressed data match, no decompression
4. The index contains SMA and PSMA
Here that the PSMA SMA and are used only for cold data, to avoid affecting the write tp
Data blocks of the layout, as shown on the right,
The first is the tuple count,
Then the offsets associated attributes of each column
Followed by real data,
Column deposit, so there is a column, a column, row past
Each column contains, SMA, PSMA index, dictionary, data, string
A deposit and then deposit the next column
Positional SMAs
SMA, Small Materialized Aggregates, in fact, the largest recorded in each block, minimum
PSMA, recording max, min can prune to predicate, but if falls max, min, if no other indexes need assistance, you can only scan
So here incorporated PSMA, using a lookup table to record the scan range,
lookup table as follows,
Such as the right algorithm,
Due to a relatively high accuracy of small value, do so with a delta index,
The principle is to do with the highest non-zero byte index of the index, because the algorithm plus r * 256 low-order byte so all the same, so the highest non-zero value will be the same place to the same entry in the index
So for the 4-byte, 8-th power requires a lookup table entry 4 * 2, since each value of each byte which require one entry
So if we are all relatively large value, it will be concentrated in a certain period of entry, so use delta
So PSMA-based index tree is smaller than conventional, because it is not accurate index, just give a range
But if the same value range of points is very casual, so PSMA is not so effective, the best situation is the same value of clustered
Attribute Compression
Here the main say, and what can be done compression method of byte-addressable
VECTORIZED SCANS IN COMPILING QUERY ENGINES
If you choose the compression algorithm in the block and column level, because it can improve the compression ratio for different data types and distribution, but also lead to different physical representation
This would JIT great challenge, because when the generated code requires compatible with all storage layout, under the more branches, the compiled code will result in an explosion
So proposed here is that
The Scan separated and processed Pipline
Scan portion with interpreted, while Pipeline section with the JIT, which can perform data obtained by interpreting, pipeline feed to the compiled