overview:
The field that ClickHouse is best at is querying a large wide table , and the performance of Clickhouse is not good when multi-table JOIN.
CK execution mode
In the first stage, the Coordinator sends the request to the corresponding worker node after receiving the query; in the second stage, the Coordinator gathers the results of each worker node and returns them after processing.
Source: Why is ClickHouse Join criticized by everyone? - Know almost
optimization suggestion
Use IN instead of JOIN
JOIN needs to build a hash table based on memory and need to store all the data in the right table, and then match the data in the left table. The IN query will build a hash set for all the data in the right table, but does not need to match the data in the left table, and does not need to write back the data to the block.
SELECT event_date,
count()
FROM tob_apps_all
WHERE app_id = 10000000
AND event_date >= '2022-01-01'
AND event_date <= '2022-08-02'
AND hash_uid global IN
(SELECT hash_uid
FROM users_unique_all
WHERE (tea_app_id = 10000000)
AND (last_active_date >= '2022-01-01') )
GROUP BY event_date
Priority local join
The data is pre-partitioned with the same rules, that is, Colocate JOIN. Prioritize the distribution of tables that need to be associated according to the same rules, so that distributed JOIN is not required for queries.
SELECT
et.os_name,
ut.device_id AS user_device_id
FROM tob_apps_all AS et
ANY LEFT JOIN
(
SELECT
device_id,
hash_uid
FROM users_unique_all
WHERE (tea_app_id = 268411) AND (last_active_date >= '2022-08-06')
) AS ut ON et.hash_uid = ut.hash_uid
WHERE (tea_app_id = 268411)
AND (event = 'app_launch')
AND (event_date = '2022-08-06')
settings distributed_perfect_shard=1
For example, the event table tob_apps_all and the user table users_unique_all are stored in shards according to the user ID, and the data of the two tables of the same user is on the same shard, so the JOIN of these two tables does not need a distributed JOIN.
Source: JOIN Optimization of ClickHouse Engine in Behavior Analysis Scenario - Short Book
Engine level optimization
The Join table engine can be said to be born for JOIN queries, which is equivalent to a simple encapsulation of JOIN queries.
Explanation:
What needs to be explained is that the more common use of the Join table engine is the right table for the Join connection query. And the data in the Join table is first written to the memory and then synchronized to the disk file. This means two things:
1. The query speed of the Join table is very fast, because its existence is originally to optimize the speed of the connection query;
2. The Join table is not suitable for storing large tables with tens of millions or more , otherwise it will take up too much It is more suitable for storing small tables that need to be queried frequently, and is usually the right table of the join statement.
Join(ANY|ALL, LEFT|INNER, k1[, k2, ...])
Engine parameters: ANY|ALL – join decoration; LEFT|INNER – join type. See JOIN Clause for more information.
These parameters are set without quotes, but must match the tables to be JOINed. k1, k2, ... are the key columns to be used for the join in the USING clause.
This engine table cannot be used for GLOBAL JOIN .
Similar to the Set engine, you can use INSERT to add data to a table. When set to ANY, data with duplicate keys is ignored (only one is used for joins). When set to ALL, data for duplicate keys is used for joins. You cannot directly SELECT a JOIN table. The only way to retrieve its data is as a table on the right side of a JOIN statement.
Similar to the Set engine, the Join engine stores data on disk.
Create a table based on the join engine
data sheet
CREATE TABLE join_tb1 (
id UInt8,
name String,
time Datetime
) ENGINE = Log
join table
CREATE TABLE id_join_tb1 (
id UInt8,
price UInt32,
time Datetime
) ENGINE = Join (ANY, LEFT, id);
Insert test data
INSERT INTO TABLE join_tb1 VALUES
(1,'ClickHouse','2019-05-01 12:00:00'),
(2,'Spark', '2019-05-01 12:30:00'),
(3,'ElasticSearch','2019-05-01 13:00:00');
INSERT INTO TABLE id_join_tb1 VALUES
(1,100,'2019-05-01 11:55:00'),
(1,105,'2019-05-01 11:10:00'),
(2,90,'2019-05-01 12:01:00'),
(3,80,'2019-05-01 13:10:00'),
(5,70,'2019-05-01 14:00:00'),
(6,60,'2019-05-01 13:50:00');
join query
#This paragraph means that the data table join_tb1 uses the id field to associate the join table id_join_tb1 with the price field of id_join_tb1.
SELECT id,name,joinGet ('id_join_tb1', 'price', id) as price
FROM join_tb1 ;
take a specific
SELECT joinGet ('id_join_tb1', 'price', toUInt8 (1));
Source: Join Table Engine | ClickHouse Docs
Create a view based on the join engine
Create data table
drop table if exists user_order;
create table user_order
(
user_id String, // 用户ID
event_date String, // 付款日期
order_no String, // 订单号
amount Int32 // 金额
) ENGINE = MergeTree()
ORDER BY (user_id, event_date)
create view
drop view if exists user_order_userid_j;
CREATE MATERIALIZED VIEW user_order_userid_j
ENGINE = Join(ANY, INNER, user_id)
POPULATE
AS
select user_id, event_date, order_no, amount
from user_order
insert data
insert into user_order(user_id, event_date, order_no, amount)
values
('user1', '2022-01-01', 'B', 4),
('user1', '2022-01-01', 'C', 8),
('user1', '2022-01-01', 'A', 2),
('user2', '2022-01-02', 'E', 3),
('user2', '2022-01-02', 'D', 7),
('user1', '2022-01-02', 'X', 6),
('user1', '2022-01-02', 'Y', 9)
join query
select * from user_order_userid_j
source:
Using the Join table engine in clickhouse - Technology - Zhang Ziyang's Blog