ClickHouse and his friends (12) The magical materialized view (Materialized View) and principle

Original source: https://bohutang.me/2020/08/31/clickhouse-and-friends-materialized-view/

Last Update: 2020-08-31

In ClickHouse, the materialized view (Materialized View) can be said to be a magical and powerful thing with a unique purpose.

This article analyzes the underlying mechanism and takes a look at how ClickHouse's Materialized View works to facilitate better use of it.

What is a materialized view

For most people, the concept of materialized views is more abstract, materialized? view? . . .

In order to understand it better, let's look at a scene first.

Suppose you are *hub, a "happy" little programmer, and one day the product manager has a requirement: real-time statistics of video downloads per hour.

User download schedule:

clickhouse> SELECT * FROM download LIMIT 10;
+---------------------+--------+--------+
| when                | userid | bytes  |
+---------------------+--------+--------+
| 2020-08-31 18:22:06 |     19 | 530314 |
| 2020-08-31 18:22:06 |     19 | 872957 |
| 2020-08-31 18:22:06 |     19 | 107047 |
| 2020-08-31 18:22:07 |     19 | 214876 |
| 2020-08-31 18:22:07 |     19 | 820943 |
| 2020-08-31 18:22:07 |     19 | 693959 |
| 2020-08-31 18:22:08 |     19 | 882151 |
| 2020-08-31 18:22:08 |     19 | 644223 |
| 2020-08-31 18:22:08 |     19 | 199800 |
| 2020-08-31 18:22:09 |     19 | 511439 |

... ....

Calculate downloads per hour:

clickhouse> SELECT toStartOfHour(when) AS hour, userid, count() as downloads, sum(bytes) AS bytes FROM download GROUP BY userid, hour ORDER BY userid, hour;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.13 sec)

It's easy, but there is a problem: every time you have to download calculate based on  table data, the amount of *hub data is too large to bear.

Think of a way: if you  download perform pre-aggregation, save the result to a new table  download_hour_mv, and download update it in real time with the  increment, it is download_hour_mv not enough to query each time .

This new table can be regarded as a materialized view, which is a normal table in ClickHouse.

Create a materialized view

clickhouse> CREATE MATERIALIZED VIEW download_hour_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(hour) ORDER BY (userid, hour)
AS SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM download WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

This statement mainly does:

  • Create a SummingMergeTree materialized view for the  engine download_hour_mv

  • The data of the materialized view comes from the  download table, and the  select corresponding "materialized" operation is performed according to the expression in the statement

  • Select a future time (the current time is  2020-08-31 18:00:00) as the starting point  WHERE when >= toDateTime('2020-09-01 04:00:00'), indicating 2020-09-01 04:00:00 that the data will be synchronized to download_hour_mv

In this way, it download_hour_mv is currently  an empty list:

clickhouse> SELECT * FROM download_hour_mv ORDER BY userid, hour;
Empty set (0.02 sec)

Note: POPULATE (https://clickhouse.tech/docs/en/sql-reference/statements/create/view/#materialized) keyword is officially available, but it is not recommended because it download will be lost if data is written during view creation  , Which is why we add one  WHERE as a data synchronization point.

So, how can we synchronize the source table data consistently  download_hour_mv ?

Materialized full data

In the 2020-09-01 04:00:00following, we can with a  WHERE snapshot INSERT INTO SELECT... of  download historical data materialized:

clickhouse> INSERT INTO download_hour_mv
SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM download WHERE when < toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

Query materialized views:

clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads DESC;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
+---------------------+--------+-----------+------------+
10 rows in set (0.05 sec)

You can see that the data has been "materialized"  download_hour_mv.

Materialized incremental data

Write some data to the  downloadtable:

clickhouse> INSERT INTO download
       SELECT
         toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
         19,
         rand() % 1000000
       FROM system.numbers
       LIMIT 10;

Query materialized views  download_hour_mv:

clickhouse> SELECT * FROM download_hour_mv ORDER BY hour, userid, downloads;
+---------------------+--------+-----------+------------+
| hour                | userid | downloads | bytes      |
+---------------------+--------+-----------+------------+
| 2020-08-31 18:00:00 |     19 |      6822 | 3378623036 |
| 2020-08-31 19:00:00 |     19 |     10800 | 5424173178 |
| 2020-08-31 20:00:00 |     19 |     10800 | 5418656068 |
| 2020-08-31 21:00:00 |     19 |     10800 | 5404309443 |
| 2020-08-31 22:00:00 |     19 |     10800 | 5354077456 |
| 2020-08-31 23:00:00 |     19 |     10800 | 5390852563 |
| 2020-09-01 00:00:00 |     19 |     10800 | 5369839540 |
| 2020-09-01 01:00:00 |     19 |     10800 | 5384161012 |
| 2020-09-01 02:00:00 |     19 |     10800 | 5404581759 |
| 2020-09-01 03:00:00 |     19 |      6778 | 3399557322 |
| 2020-09-01 04:00:00 |     19 |        10 |    5732600 |
+---------------------+--------+-----------+------------+
11 rows in set (0.00 sec)

It can be seen that the last piece of data is a materialized aggregation of our increment, which has been synchronized in real time. How is this done?

Materialized View Principle

The materialized view principle of ClickHouse is not complicated. When  download new data is written to the table, if a materialized view is detected associated with it, the materialized operation will be performed on the written data.

For example, the new data above is generated by the following SQL:

clickhouse> SELECT
    ->          toDateTime('2020-09-01 04:00:00') + number*(1/3) as when,
    ->          19,
    ->          rand() % 1000000
    ->        FROM system.numbers
    ->        LIMIT 10;
+---------------------+------+-------------------------+
| when                | 19   | modulo(rand(), 1000000) |
+---------------------+------+-------------------------+
| 2020-09-01 04:00:00 |   19 |                  870495 |
| 2020-09-01 04:00:00 |   19 |                  322270 |
| 2020-09-01 04:00:00 |   19 |                  983422 |
| 2020-09-01 04:00:01 |   19 |                  759708 |
| 2020-09-01 04:00:01 |   19 |                  975636 |
| 2020-09-01 04:00:01 |   19 |                  365507 |
| 2020-09-01 04:00:02 |   19 |                  865569 |
| 2020-09-01 04:00:02 |   19 |                  975742 |
| 2020-09-01 04:00:02 |   19 |                   85827 |
| 2020-09-01 04:00:03 |   19 |                  992779 |
+---------------------+------+-------------------------+
10 rows in set (0.02 sec)

The statement executed by the materialized view is similar:

INSERT INTO download_hour_mv
SELECT
  toStartOfHour(when) AS hour,
  userid,
  count() as downloads,
  sum(bytes) AS bytes
FROM [新增的10条数据] WHERE when >= toDateTime('2020-09-01 04:00:00')
GROUP BY userid, hour

Code navigation:

  1. Add view OutputStream, InterpreterInsertQuery.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/Interpreters/InterpreterInsertQuery.cpp#L313)

                if (table->noPushingToViews() && !no_destination)
                    out = table->write(query_ptr, metadata_snapshot, context);
                else
                    out = std::make_shared<PushingToViewsBlockOutputStream>(table, metadata_snapshot, context, query_ptr, no_destination);
    
  2. Construct Insert, PushingToViewsBlockOutputStream.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/DataStreams/PushingToViewsBlockOutputStream.cpp#L85)

                ASTPtr insert_query_ptr(insert.release());
                InterpreterInsertQuery interpreter(insert_query_ptr, *insert_context);
                BlockIO io = interpreter.execute();
                out = io.out;
    
  3. Materialized new data: PushingToViewsBlockOutputStream.cpp (https://github.com/ClickHouse/ClickHouse/blob/cb4644ea6d04b3d5900868b4f8d686a03082379a/src/DataStreams/PushingToViewsBlockOutputStream.cpp#L331)

            Context local_context = *select_context;
            local_context.addViewSource(
                StorageValues::create(
                    storage->getStorageID(), metadata_snapshot->getColumns(), block, storage->getVirtuals()));
            select.emplace(view.query, local_context, SelectQueryOptions());
            in = std::make_shared<MaterializingBlockInputStream>(select->execute().getInputStream()

to sum up

There are many uses for materialized views.

For example, to solve the table index problem, we can use materialized views to create another physical sequence to meet the query problem under certain conditions.

Also, through the real-time data synchronization capabilities of materialized views, we can achieve more flexible table structure changes.

The more powerful place is that it can use the MergeTree family engine (SummingMergeTree, Aggregatingmergetree, etc.) to get a real-time pre-aggregation to satisfy fast queries.

The principle is to process the incremental data  AS SELECT ... and write it to the materialized view table. The materialized view is an ordinary table that can be read and written directly.

The full text is over.

Enjoy ClickHouse :)

Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/111771420