Multi-field_Any combination conditional query (0 modeling)_-_Millisecond real-time circle people_Practice

To read the original text, please click: http://click.aliyun.com/m/22881/
Abstract: Tag PostgreSQL, array, GIN index, any field combination query, circle people, ToB analytical business, modeling background You may be in a ToB For a data analysis company, you may have designed a table (including user IDs and some statistical attribute values), you may have collected some user data, you may have to provide reports for customers, and you may need to provide Provides a combined query of any attribute value and returns the result quickly to the user.

Tags

: PostgreSQL, array, GIN index, query with any combination of fields, circle people, ToB analytical business, modeling

Background

You may be in a ToB data analysis company, and you may have designed a table (including user ID, and some statistics that have been Good attribute values), you may collect some user data, you may provide reports for customers, you may need to provide customers with combined query of any attribute value, and quickly return results to users.

These requirements should be in the form of a very common ToB data platform company. The headache problem cannot be modeled, because the requirements of the B side are elusive, any combination of queries and real-time responses are required.

Your customer data may have billions or tens of billions, customer data may have hundreds of attributes, and users may need the result of any combination of attributes.

If you want to respond quickly, is your first reaction to index the query conditions?

For example

, where col1=? and col2=? and col3<>? or col4=?; 
SQL, how are you going to respond in real time? (col1,col2) builds the index, col4 builds the index, is that right?

But the user's next request is willing to change the conditions again

where col3=1 or col100=? 
Is it necessary to build the index of col3 and col100 again?

You will find that there is no way to optimize, because the index combination corresponding to the query may be in the thousands.

PostgreSQL's black technology for arbitrary field retrieval

I have written some practical articles about arbitrary field query before, which are widely used in scenarios such as the circle of people in advertising and marketing platforms, the circle of people in ToB, and the screening of any combination of front-end pages.

Method 1, GIN composite index Create a GIN composite index

for the fields that need to be queried.

The pic

CASE is as follows:

"Any Combination Field Equivalent Query, Explore PostgreSQL Multi-column Expanded B-Tree (GIN)"

This scenario is for any field matching scenario. For multiple query conditions, PostgreSQL will internally use index + bitmapAnd or bitmapOr to Filter BLOCK to get intermediate results.

+---------------------------------------------+   
|100000000001000000010000000000000111100000000| bitmap 1   
|000001000001000100010000000001000010000000010| bitmap 2   
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&   

+------------+-------+--------------+----------+   
            | | |   
            vvv   
Used to scan the heap only for matching pages:   
+-------------------------------------- -------+   
|___________X_______X______________X__________|   
+-------------------------------------- -------+   
Why is this method fast?

The reason is that the GIN index implements the internal bitmapAnd or bitmapOr, which is actually equivalent to building a separate B-Tree index for each field (PostgreSQL also supports the merging of bitmapAnd and bitmapOr for multiple B-Tree indexes).

The principle of bitmapand,or is as follows:

"PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan"

GIN composite index This method can meet the above requirements, but when the amount of data is very large or there are many columns, the GIN index will be relatively large.

Method 1 Optimization Tips

It is recommended to split into multiple tables (such as random split, or split according to mandatory conditions). Reduce the size of the GIN index, and can also take advantage of the multi-table parallel feature of PostgreSQL 10 to improve query performance.

PostgreSQL Parallel Computing Features

PostgreSQL supports single-table multi-core parallelism, as well as parallel query of multiple tables.

Single- table parallelism refers to a piece of SQL that can use multiple CPUs to perform operations when processing data in a single table.

Multi-table parallelism refers to a SCAN that can process multiple tables in parallel when a piece of SQL involves the processing of multiple tables (such as APPEND SCAN).

Multi-table parallelism was added in PG 10 version, PostgreSQL 10 append scan parallel

"PostgreSQL 10.0 preview sharding enhancement - support for append node parallelism"
, please click to read the original text: http://click.aliyun.com/m/22881/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326617525&siteId=291194637