ClickHouse Advanced - Multi-table connection materialized view

Introduction

When writing this article, the materialized view of doris 1.2 only supports the establishment of a materialized view for a single table. Now let’s talk about the materialized view of ClickHouse multi-table.

foreword

This article is translated from Altinity's series of technical articles on ClickHouse. ClickHouse, an open source analysis engine for online analytical processing (OLAP), is widely adopted by domestic and foreign companies because of its excellent query performance, PB-level data scale, and simple architecture.

The Alibaba Cloud EMR-OLAP team has performed a series of optimizations based on the open source ClickHouse, and provided cloud hosting services for the open source OLAP analysis engine ClickHouse. EMR ClickHouse is fully compatible with the product features of the open source version, and provides cloud product functions such as cluster rapid deployment, cluster management, capacity expansion, capacity reduction, and monitoring and alarming, and optimizes the read and write performance of ClickHouse on the basis of open source, improving the performance of ClickHouse and The ability to quickly integrate with other EMR components. Visit https://help.aliyun.com/document_detail/212195.html
for details.

Using Join in ClickHouse Materialized View

ClickHouse materialized views provide a powerful way to reorganize data in ClickHouse. We've discussed its capabilities many times in webinars, blog posts, and conference talks. One of the most common follow-up questions we receive is: Do Materialized Views support Joins.

The answer is yes. This blog post shows how. If you want the short answer, here it is: the materialized view triggers the leftmost table in the Join. The materialized view will pull values ​​from the right tables in the Join, but will not fire if those tables change.

Read on for a detailed example of materialized views and Join behavior. We'll also explain the underlying mechanics to help you better understand ClickHouse behavior when creating your own views. Note: The examples are from ClickHouse version 20.3.

table definition

Materialized views can transform data in all kinds of interesting ways, but let's just talk about the simple ones. We'll use the download table as an example to demonstrate how to build a Total Daily Downloads metric that pulls information from several dimension tables. A summary of the pattern follows.

We first define the download table. This table can grow very large.

CREATE TABLE download (

  when DateTime,

  userid UInt32,

  bytes UInt64

) ENGINE=MergeTree

PARTITION BY toYYYYMM(when)

ORDER BY (userid, when)

Next, we define a dimension table that maps user IDs to prices per GB of download. This table is relatively small.

CREATE TABLE price (

  userid UInt32,

  price_per_gb Float64

) ENGINE=MergeTree

PARTITION BY tuple()

ORDER BY userid

Finally, we define a dimension table that maps user IDs to names. The watch is also very small.

CREATE TABLE user (

  userid UInt32,

  name String

) ENGINE=MergeTree

PARTITION BY tuple()

ORDER BY userid

Materialized View Definition

Now, let's create a materialized view that summarizes daily downloads and bytes by user ID and calculates a price based on the number of bytes downloaded. We need to create the target table directly, and then use a materialized view definition with the TO keyword pointing to our table.

The target table is as follows.

CREATE TABLE download_daily (

  day Date,

  userid UInt32,

  downloads UInt32,

  total_gb Float64,

  total_price Float64

)

ENGINE = SummingMergeTree

PARTITION BY toYYYYMM(day) ORDER BY (userid, day)

The above definition makes use of the specialized SummingMergeTree behavior. Any non-key numeric field counts as an aggregate, so we don't have to use aggregate functions in the column definition.

Finally, here is our materialized view definition. It is also possible to define it in a more compact way, but as you will see shortly, this form makes it easier to extend the view to join with more tables.

CREATE MATERIALIZED VIEW download_daily_mv

TO download_daily AS

SELECT

  day AS day, userid AS userid, count() AS downloads,

  sum(gb) as total_gb, sum(price) as total_price

FROM (

  SELECT

    toDate(when) AS day,

    userid AS userid,

    download.bytes / (1024*1024*1024) AS gb,

    gb * price.price_per_gb AS price

  FROM download LEFT JOIN price ON download.userid = price.userid

)

GROUP BY userid, day

Download Data

We can now test the view by loading the data. We start by loading two dimension tables with username and price information.

INSERT INTO price VALUES (25, 0.10), (26, 0.05), (27, 0.01);

INSERT INTO user VALUES (25, 'Bob'), (26, 'Sue'), (27, 'Sam');

Next, we add sample sample data to the download fact table. The following INSERT adds 5000 rows, evenly distributed by the userid values ​​listed in the user table.

INSERT INTO download

  WITH

    (SELECT groupArray(userid) FROM user) AS user_ids

  SELECT

    now() + number * 60 AS when,

    user_ids[(number % length(user_ids)) + 1] AS user_id,

    rand() % 100000000 AS bytes

  FROM system.numbers

  LIMIT 5000

At this point we can see that the materialized view fills the download_daily with data. Below is an example query.

SELECT day, downloads, total_gb, total_price

FROM download_daily WHERE userid = 25

┌────────day─┬─downloads─┬───────────total_gb─┬────────total_price─┐

│ 2020-07-14 │       108 │  5.054316438734531 │ 0.5054316438734532 │

│ 2020-07-15 │       480 │  22.81532768998295 │  2.281532768998296 │

│ 2020-07-16 │       480 │  21.07045224122703 │  2.107045224122702 │

│ 2020-07-17 │       480 │ 21.606687822379172 │ 2.1606687822379183 │

│ 2020-07-18 │       119 │  5.548438269644976 │ 0.5548438269644972 │

└────────────┴───────────┴────────────────────┴────────────────────┘

So far so good. But we can go further. Let's first look at the principles behind ClickHouse.

Get to the bottom of it

To use materialized views effectively, it helps to understand the rationale behind them. A materialized view operates on a single table as a post-insert trigger. If the query in the materialized view definition includes a Join, the source table is the left table in the Join.

In our example, download is the left table. Therefore, any insert into download will result in a shard being written to download_daily. Although the value is added to the Join, the insert on the user has no effect.

It's easy to demonstrate this behavior if we create a more interesting materialized view. Let's define a view that does a right outer join on the user table. In this case, we'll use a simple MergeTree table so we can see all generated rows without doing a merge like SummingMergeTree. Below is a simple target table followed by a materialized view that will populate the target table from the download table.

CREATE TABLE download_right_outer (

  when DateTime,

  userid UInt32,

  name String,

  bytes UInt64

) ENGINE=MergeTree

PARTITION BY toYYYYMM(when)

ORDER BY (when, userid)

CREATE MATERIALIZED VIEW download_right_outer_mv

TO download_right_outer

AS SELECT

  when AS when,

  user.userid AS userid,

  user.name AS name,

  bytes AS bytes

FROM download RIGHT OUTER JOIN user ON (download.userid = user.userid)

What happens when we insert a row into the download table? The materialized view generates one row for each insert *and* any row that doesn't match into the user table because we're doing a right outer join. (As you may have noticed, this view also has a potential flaw. We'll deal with that shortly.)

INSERT INTO download VALUES (now(), 26, 555)

SELECT * FROM download_right_outer

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 2020-07-12 17:27:35 │     26 │ Sue  │   555 │

└─────────────────────┴────────┴──────┴───────┘

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 0000-00-00 00:00:00 │     25 │ Bob  │     0 │

│ 0000-00-00 00:00:00 │     27 │ Sam  │     0 │

└─────────────────────┴────────┴──────┴───────┘

On the other hand, if you insert a row into the user table, nothing changes in the materialized view.

INSERT INTO user VALUES (28, 'Kate')

SELECT * FROM download_right_outer

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 2020-07-12 17:27:35 │     26 │ Sue  │   555 │

└─────────────────────┴────────┴──────┴───────┘

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 0000-00-00 00:00:00 │     25 │ Bob  │     0 │

│ 0000-00-00 00:00:00 │     27 │ Sam  │     0 │

└─────────────────────┴────────┴──────┴───────┘

Only when you add more rows to the download table will you see the effect of the new user row.

Join to multiple tables

Like a SELECT statement, a materialized view can join multiple tables. In the first example, we joined the download price (varies by userid). Now let's join the second user table, which maps a userid to a username. In this example, we'll add a new target table with a username column added. Since username is not an aggregate, we also add it to the ORDER BY. This will prevent the SummingMergeTree engine from trying to aggregate it.

CREATE TABLE download_daily_with_name (

  day Date,

  userid UInt32,

  username String,

  downloads UInt32,

  total_gb Float64,

  total_price Float64

)

ENGINE = SummingMergeTree

PARTITION BY toYYYYMM(day) ORDER BY (userid, day, username)

Now let's define a materialized view, which extends the SELECT of the first example in a simple and straightforward manner.

CREATE MATERIALIZED VIEW download_daily_with_name_mv

TO download_daily_with_name AS

SELECT

  day AS day, userid AS userid, user.name AS username,

  count() AS downloads, sum(gb) as total_gb, sum(price) as total_price

FROM (

  SELECT

    toDate(when) AS day,

    userid AS userid,

    download.bytes / (1024*1024*1024) AS gb,

    gb * price.price_per_gb AS price

  FROM download LEFT JOIN price ON download.userid = price.userid

) AS join1

LEFT JOIN user ON join1.userid = user.userid

GROUP BY userid, day, username

 You can test the new view by truncating the download table and reloading the data. This will be left as an exercise for the reader.

Carefully make a wish

The ClickHouse SELECT statement supports a wide range of Join types, which provides great flexibility in the transformations implemented by materialized views. Flexibility can be a double-edged sword, as it creates more opportunities for unintended outcomes.

For example, what happens if you insert a row with userid 30 in download? This userid does not exist in either the user table or the price table.

INSERT INTO download VALUES (now(), 30, 222)

In short: if you don't define the materialized view carefully, the row might not appear in the target table. To ensure a match, you must do a LEFT OUTER JOIN or FULL OUTER JOIN. This makes sense, since this is the same behavior as running the SELECT itself. The download_right_outer_mv example has exactly the problem described above.

View definitions can also produce subtle syntax errors. For example, missing a GROUP BY item can cause puzzling failures. Below is a simple example.

CREATE MATERIALIZED VIEW download_daily_join_old_style_mv

ENGINE = SummingMergeTree PARTITION BY toYYYYMM(day)

ORDER BY (userid, day) POPULATE AS SELECT 

    toDate(when) AS day, 

    download.userid AS userid, 

    user.username AS name, 

    count() AS downloads, 

    sum(bytes) AS bytes

FROM download INNER JOIN user ON download.userid = user.userid

GROUP BY userid, day  -- Column `username` is missing!

Received exception from server (version 20.3.8):

Code: 10.DB::Exception: Received from localhost:9000.DB::Exception: Not found column name in block. There are only columns: userid, toStartOfDay(when), count(), sum(bytes).

What went wrong? The GROUP BY is missing from the username column. It is reasonable for ClickHouse to reject the view definition, but the error message is a bit difficult to interpret.

Finally, it is important to specify columns carefully when they overlap between joined tables. Below is a slightly different version of the RIGHT OUTER JOIN example above.

CREATE MATERIALIZED VIEW download_right_outer_mv

TO download_right_outer

AS SELECT

  when AS when,

  userid,  

  user.name AS name,

  bytes AS bytes

FROM download RIGHT OUTER JOIN user ON (download.userid = user.userid)

When you insert rows in download, you will get the result like below, where userid has been removed from non-matching rows.

SELECT * FROM download_right_outer

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 0000-00-00 00:00:00 │      0 │ Sue  │     0 │

│ 0000-00-00 00:00:00 │      0 │ Sam  │     0 │

└─────────────────────┴────────┴──────┴───────┘

┌────────────────when─┬─userid─┬─name─┬─bytes─┐

│ 2020-07-12 18:04:56 │     25 │ Bob  │   222 │

└─────────────────────┴────────┴──────┴───────┘

In this case, ClickHouse seems to enter a default value instead of assigning a value from user.userid. You must explicitly name column values ​​and use AS userid to assign names. You cannot achieve this effect if you run the SELECT query alone. This behavior looks like a bug.

in conclusion

Materialized Views are one of the most versatile features available to ClickHouse users. A materialized view is populated by a SELECT statement that can join multiple tables. The key thing to understand is that ClickHouse only triggers the leftmost table in the Join. Other tables provide data for transformation, but views do not react to inserts on these tables.

Joins bring new flexibility, but can also lead to unexpected results. Therefore, it is best to test materialized views carefully, especially if there is a Join.

reference

Using Joins in ClickHouse Materialized Views – Altinity | The Real Time Data Company

 

Guess you like

Origin blog.csdn.net/S1124654/article/details/129294600