Original source: https://bohutang.me/2020/08/31/clickhouse-and-friends-materialized-view/
Last Update: 2020-08-31
In ClickHouse, the materialized view (Materialized View) can be said to be a magical and powerful thing with a unique purpose.
This article analyzes the underlying mechanism and takes a look at how ClickHouse's Materialized View works to facilitate better use of it.
What is a materialized view
For most people, the concept of materialized views is more abstract, materialized? view? . . .
In order to understand it better, let's look at a scene first.
Suppose you are *hub, a "happy" little programmer, and one day the product manager has a requirement: real-time statistics of video downloads per hour.
User download schedule:
clickhouse> SELECT * FROM download LIMIT 10;
+---------------------+--------+--------+
| when | userid | bytes |
+---------------------+--------+--------+
| 2020-08-31 18:22:06 | 19 | 530314 |
| 2020-08-31 18:22:06 | 19 | 872957 |
| 2020-08-31 18:22:06 | 19 | 107047 |
| 2020-08-31 18:22:07 | 19 | 214876 |
| 2020-08-31 18:22:07 | 19 | 820943 |
| 2020-08-31 18:22:07 | 19 | 693959 |
| 2020-08-31 18:22:08 | 19 | 882151 |
| 2020-08-31 18:22:08 | 19 | 644223 |
| 2020-08-31 18:22:08 | 19 | 199800 |
| 2020-08-31 18:22:09 | 19 | 511439 |
... ....
Calculate downloads per hour:
clickhouse> SELECT toStartOfHour(when) AS hour, userid, count() as downloads, sum(bytes) AS bytes FROM download GROUP BY userid, hour ORDER BY userid, hour;
+---------------------+--------+-----------+------------+
| hour | userid | downloads | bytes |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 | 19 | 6822 | 3378623036 |
| 2020-08-31 19:00:00 | 19 | 10800 | 5424173178 |
| 2020-08-31 20:00:00 | 19 | 10800 | 5418656068 |
| 2020-08-31 21:00:00 | 19 | 10800 | 5404309443 |
| 2020-08-31 22:00:00 | 19 | 10800 | 5354077456 |
| 2020-08-31 23:00:00 | 19 | 10800 | 5390852563 |
| 2020-09-01 00:00:00 | 19 | 10800 | 5369839540 |
| 2020-09-01 01:00:00 | 19 | 10800 | 5384161012 |
| 2020-09-01 02:00:00 | 19 | 10800 | 5404581759 |
| 2020-09-01 03:00:00 | 19 | 6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.13 sec)
It's easy, but there is a problem: every time you have to download
calculate based on table data, the amount of *hub data is too large to bear.
Think of a way: if you download
perform pre-aggregation, save the result to a new table download_hour_mv
, and download
update it in real time with the increment, it is download_hour_mv
not enough to query each time .
This new table can be regarded as a materialized view, which is a normal table in ClickHouse.
Create a materialized view
clickhouse> CREATE MATERIALIZED VIEW download_hour_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(hour) ORDER BY (userid, hour)
AS SELECT
toStartOfHour(when) AS hour,
userid,
count() as downloads,
sum(bytes) AS bytes
FROM download WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour
This statement mainly does:
Create a
SummingMergeTree
materialized view for the enginedownload_hour_mv
The data of the materialized view comes from the
download
table, and theselect
corresponding "materialized" operation is performed according to the expression in the statementSelect a future time (the current time is
2020-08-31 18:00:00
) as the starting pointWHERE when >= toDateTime('2020-09-01 04:00:00')
, indicating2020-09-01 04:00:00
that the data will be synchronized todownload_hour_mv
In this way, it download_hour_mv
is currently an empty list:
clickhouse> SELECT * FROM download_hour_mv ORDER BY userid, hour;
Empty set (0.02 sec)
Note: POPULATE (https://clickhouse.tech/docs/en/sql-reference/statements/create/view/#materialized) keyword is officially available, but it is not recommended because it download
will be lost if data is written during view creation , Which is why we add one WHERE
as a data synchronization point.
So, how can we synchronize the source table data consistently download_hour_mv
?
Materialized full data
In the 2020-09-01 04:00:00
following, we can with a WHERE
snapshot INSERT INTO SELECT...
of download
historical data materialized:
clickhouse> INSERT INTO download_hour_mv
SELECT
toStartOfHour(when) AS hour,
userid,
count() as downloads,
sum(bytes) AS bytes
FROM download WHERE when < toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour
Query materialized views:
clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads DESC;
+---------------------+--------+-----------+------------+
| hour | userid | downloads | bytes |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 | 19 | 6822 | 3378623036 |
| 2020-08-31 19:00:00 | 19 | 10800 | 5424173178 |
| 2020-08-31 20:00:00 | 19 | 10800 | 5418656068 |
| 2020-08-31 21:00:00 | 19 | 10800 | 5404309443 |
| 2020-08-31 22:00:00 | 19 | 10800 | 5354077456 |
| 2020-08-31 23:00:00 | 19 | 10800 | 5390852563 |
| 2020-09-01 00:00:00 | 19 | 10800 | 5369839540 |
| 2020-09-01 01:00:00 | 19 | 10800 | 5384161012 |
| 2020-09-01 02:00:00 | 19 | 10800 | 5404581759 |
| 2020-09-01 03:00:00 | 19 | 6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.05 sec)
You can see that the data has been "materialized" download_hour_mv
.
Materialized incremental data
Write some data to the download
table:
clickhouse> INSERT INTO download
SELECT
toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
19,
rand() % 1000000
FROM system.numbers
LIMIT 10;
Query materialized views download_hour_mv
:
clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads;
+---------------------+--------+-----------+------------+
| hour | userid | downloads | bytes |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 | 19 | 6822 | 3378623036 |
| 2020-08-31 19:00:00 | 19 | 10800 | 5424173178 |
| 2020-08-31 20:00:00 | 19 | 10800 | 5418656068 |
| 2020-08-31 21:00:00 | 19 | 10800 | 5404309443 |
| 2020-08-31 22:00:00 | 19 | 10800 | 5354077456 |
| 2020-08-31 23:00:00 | 19 | 10800 | 5390852563 |
| 2020-09-01 00:00:00 | 19 | 10800 | 5369839540 |
| 2020-09-01 01:00:00 | 19 | 10800 | 5384161012 |
| 2020-09-01 02:00:00 | 19 | 10800 | 5404581759 |
| 2020-09-01 03:00:00 | 19 | 6778 | 3399557322 |
| 2020-09-01 04:00:00 | 19 | 10 | 5732600 |
+---------------------+--------+-----------+------------+
11 rows in set (0.00 sec)
It can be seen that the last piece of data is a materialized aggregation of our increment, which has been synchronized in real time. How is this done?
Materialized View Principle
The materialized view principle of ClickHouse is not complicated. When download
new data is written to the table, if a materialized view is detected associated with it, the materialized operation will be performed on the written data.
For example, the new data above is generated by the following SQL:
clickhouse> SELECT
-> toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
-> 19,
-> rand() % 1000000
-> FROM system.numbers
-> LIMIT 10;
+---------------------+------+-------------------------+
| when | 19 | modulo(rand(), 1000000) |
+---------------------+------+-------------------------+
| 2020-09-01 04:00:00 | 19 | 870495 |
| 2020-09-01 04:00:00 | 19 | 322270 |
| 2020-09-01 04:00:00 | 19 | 983422 |
| 2020-09-01 04:00:01 | 19 | 759708 |
| 2020-09-01 04:00:01 | 19 | 975636 |
| 2020-09-01 04:00:01 | 19 | 365507 |
| 2020-09-01 04:00:02 | 19 | 865569 |
| 2020-09-01 04:00:02 | 19 | 975742 |
| 2020-09-01 04:00:02 | 19 | 85827 |
| 2020-09-01 04:00:03 | 19 | 992779 |
+---------------------+------+-------------------------+
10 rows in set (0.02 sec)
The statement executed by the materialized view is similar:
INSERT INTO download_hour_mv
SELECT
toStartOfHour(when) AS hour,
userid,
count() as downloads,
sum(bytes) AS bytes
FROM [新增的10条数据] WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour
Code navigation:
Add view OutputStream, InterpreterInsertQuery.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/Interpreters/InterpreterInsertQuery.cpp#L313)
if (table->noPushingToViews() && !no_destination) out = table->write(query_ptr, metadata_snapshot, context); else out = std::make_shared<PushingToViewsBlockOutputStream>(table, metadata_snapshot, context, query_ptr, no_destination);
Construct Insert, PushingToViewsBlockOutputStream.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/DataStreams/PushingToViewsBlockOutputStream.cpp#L85)
ASTPtr insert_query_ptr(insert.release()); InterpreterInsertQuery interpreter(insert_query_ptr, *insert_context); BlockIO io = interpreter.execute(); out = io.out;
Materialized new data: PushingToViewsBlockOutputStream.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/DataStreams/PushingToViewsBlockOutputStream.cpp#L331)
Context local_context = *select_context;
local_context.addViewSource(
StorageValues::create(
storage->getStorageID(), metadata_snapshot->getColumns(), block, storage->getVirtuals()));
select.emplace(view.query, local_context, SelectQueryOptions());
in = std::make_shared<MaterializingBlockInputStream>(select->execute().getInputStream()
to sum up
There are many uses for materialized views.
For example, to solve the table index problem, we can use materialized views to create another physical sequence to meet the query problem under certain conditions.
Also, through the real-time data synchronization capabilities of materialized views, we can achieve more flexible table structure changes.
The more powerful place is that it can use the MergeTree family engine (SummingMergeTree, Aggregatingmergetree, etc.) to get a real-time pre-aggregation to satisfy fast queries.
The principle is to process the incremental data AS SELECT ...
and write it to the materialized view table. The materialized view is an ordinary table that can be read and written directly.
The full text is over.
Enjoy ClickHouse :)
Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice