ClickHouse's join optimization

overview:

The field that ClickHouse is best at is querying a large wide table , and the performance of Clickhouse is not good when multi-table JOIN.

CK execution mode

In the first stage, the Coordinator sends the request to the corresponding worker node after receiving the query; in the second stage, the Coordinator gathers the results of each worker node and returns them after processing.

Source: Why is ClickHouse Join criticized by everyone? - Know almost

optimization suggestion

 Use IN instead of JOIN

JOIN needs to build a hash table based on memory and need to store all the data in the right table, and then match the data in the left table. The IN query will build a hash set for all the data in the right table, but does not need to match the data in the left table, and does not need to write back the data to the block.

SELECT event_date,
         count()
FROM tob_apps_all
WHERE app_id = 10000000
        AND event_date >= '2022-01-01'
        AND event_date <= '2022-08-02'
        AND hash_uid global IN 
    (SELECT hash_uid
    FROM users_unique_all
    WHERE (tea_app_id = 10000000)
            AND (last_active_date >= '2022-01-01') )
 GROUP BY event_date

Priority local join

The data is pre-partitioned with the same rules, that is, Colocate JOIN. Prioritize the distribution of tables that need to be associated according to the same rules, so that distributed JOIN is not required for queries.

SELECT 
    et.os_name, 
    ut.device_id AS user_device_id
FROM tob_apps_all AS et 
ANY LEFT JOIN 
(
    SELECT 
        device_id, 
        hash_uid
    FROM users_unique_all 
    WHERE (tea_app_id = 268411) AND (last_active_date >= '2022-08-06')
) AS ut ON et.hash_uid = ut.hash_uid
WHERE (tea_app_id = 268411) 
AND (event = 'app_launch') 
AND (event_date = '2022-08-06')
settings distributed_perfect_shard=1

 For example, the event table tob_apps_all and the user table users_unique_all are stored in shards according to the user ID, and the data of the two tables of the same user is on the same shard, so the JOIN of these two tables does not need a distributed JOIN.

Source: JOIN Optimization of ClickHouse Engine in Behavior Analysis Scenario - Short Book

Engine level optimization

The Join table engine can be said to be born for JOIN queries, which is equivalent to a simple encapsulation of JOIN queries.

Explanation:
What needs to be explained is that the more common use of the Join table engine is the right table for the Join connection query. And the data in the Join table is first written to the memory and then synchronized to the disk file. This means two things:
1. The query speed of the Join table is very fast, because its existence is originally to optimize the speed of the connection query;
2. The Join table is not suitable for storing large tables with tens of millions or more , otherwise it will take up too much It is more suitable for storing small tables that need to be queried frequently, and is usually the right table of the join statement.

Join(ANY|ALL, LEFT|INNER, k1[, k2, ...])

Engine parameters: ANY|ALL – join decoration; LEFT|INNER – join type. See JOIN Clause for more information.
These parameters are set without quotes, but must match the tables to be JOINed. k1, k2, ... are the key columns to be used for the join in the USING clause.

This engine table cannot be used for GLOBAL JOIN .

Similar to the Set engine, you can use INSERT to add data to a table. When set to ANY, data with duplicate keys is ignored (only one is used for joins). When set to ALL, data for duplicate keys is used for joins. You cannot directly SELECT a JOIN table. The only way to retrieve its data is as a table on the right side of a JOIN statement.

Similar to the Set engine, the Join engine stores data on disk.

Create a table based on the join engine

data sheet

CREATE TABLE join_tb1 (
id UInt8,
name String,
time Datetime
) ENGINE = Log

join table

CREATE TABLE id_join_tb1 (
id UInt8,
price UInt32,
time Datetime
) ENGINE = Join (ANY, LEFT, id);

Insert test data

INSERT INTO TABLE join_tb1 VALUES 
(1,'ClickHouse','2019-05-01 12:00:00'),   
(2,'Spark', '2019-05-01 12:30:00'), 
(3,'ElasticSearch','2019-05-01 13:00:00');

INSERT INTO TABLE id_join_tb1 VALUES 
(1,100,'2019-05-01 11:55:00'),
(1,105,'2019-05-01 11:10:00'),
(2,90,'2019-05-01 12:01:00'),
(3,80,'2019-05-01 13:10:00'),
(5,70,'2019-05-01 14:00:00'),
(6,60,'2019-05-01 13:50:00');

join query

#This paragraph means that the data table join_tb1 uses the id field to associate the join table id_join_tb1 with the price field of id_join_tb1.

SELECT id,name,joinGet ('id_join_tb1', 'price', id) as  price 
FROM join_tb1 ;

take a specific

SELECT joinGet ('id_join_tb1', 'price', toUInt8 (1));

Source: Join Table Engine | ClickHouse Docs

Create a view based on the join engine

Create data table

drop table if exists user_order;
create table user_order
(    
    user_id String,         // 用户ID
    event_date String,      // 付款日期
    order_no String,        // 订单号
    amount Int32            // 金额
) ENGINE = MergeTree()
ORDER BY (user_id, event_date)

create view

drop view if exists user_order_userid_j;
CREATE MATERIALIZED VIEW user_order_userid_j
ENGINE = Join(ANY, INNER, user_id)
POPULATE
AS
select user_id, event_date, order_no, amount
from user_order

insert data

insert into user_order(user_id, event_date, order_no, amount)
values
('user1', '2022-01-01', 'B', 4),
('user1', '2022-01-01', 'C', 8),
('user1', '2022-01-01', 'A', 2),
('user2', '2022-01-02', 'E', 3),
('user2', '2022-01-02', 'D', 7),
('user1', '2022-01-02', 'X', 6),
('user1', '2022-01-02', 'Y', 9)   

join query

select * from user_order_userid_j

source:

Using the Join table engine in clickhouse - Technology - Zhang Ziyang's Blog

ClickHouse Study Notes (2): Introduction to Common Table Engines of ClickHouse_clickhouse url format_leo825...'s Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/csdncjh/article/details/131029620
Recommended