This article teaches you how to use the powerful ClickHouse materialized view

picture

Number of words in this article: 11558; estimated reading time: 29 minutes

作宇:Denys Golotiuk

Reviewer: Zhuang Xiaodong (Weizhuang)

picture

introduce

In the real world, data not only needs to be stored but also processed. Processing is usually done on the application side. However, some key processing points can be moved to ClickHouse to improve data performance and manageability. One of the most powerful tools in ClickHouse is materialized views. In this article, we'll explore materialized views and how they accomplish tasks such as accelerating queries as well as data transformation, filtering, and routing.

If you would like to learn more about materialized views, we are offering a free training course later.

What is a materialized view?

The materialized view is a special trigger. When data is inserted, it executes the  SELECT  query on the data and stores the result as To a target table:

picture

This is useful in many scenarios, let's look at the most popular one - making certain queries faster.

Quick example

Take Wikistat’s 1 billion row data set as an example:

CREATE TABLE wikistat
(
    `time` DateTime CODEC(Delta(4), ZSTD(1)),
    `project` LowCardinality(String),
    `subproject` LowCardinality(String),
    `path` String,
    `hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);

Ok.

INSERT INTO wikistat SELECT *
FROM s3('https://ClickHouse-public-datasets.s3.amazonaws.com/wikistat/partitioned/wikistat*.native.zst') LIMIT 1e9

Suppose we frequently query the most popular items on a certain date:

SELECT
    project,
    sum(hits) AS h
FROM wikistat
WHERE date(time) = '2015-05-01'
GROUP BY project
ORDER BY h DESC
LIMIT 10

This query takes 15 seconds to complete on the test instance:

┌─project─┬────────h─┐
│ en      │ 34521803 │
│ es      │  4491590 │
│ de      │  4490097 │
│ fr      │  3390573 │
│ it      │  2015989 │
│ ja      │  1379148 │
│ pt      │  1259443 │
│ tr      │  1254182 │
│ zh      │   988780 │
│ pl      │   985607 │
└─────────┴──────────┘

10 rows in set. Elapsed: 14.869 sec. Processed 972.80 million rows, 10.53 GB (65.43 million rows/s., 708.05 MB/s.)

If we have a large number of queries like this and we need millisecond performance from ClickHouse, we can create a materialized view for this query:

CREATE TABLE wikistat_top_projects
(
    `date` Date,
    `project` LowCardinality(String),
    `hits` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (date, project);

Ok.

CREATE MATERIALIZED VIEW wikistat_top_projects_mv TO wikistat_top_projects AS
SELECT
    date(time) AS date,
    project,
    sum(hits) AS hits
FROM wikistat
GROUP BY
    date,
    project;

In these two queries:

  • wikistat_top_projects  is the name of the table we want to use to save the materialized view,

  • wikistat_top_projects_mv  is the name of the materialized view itself (trigger),

  • We used the SummingMergeTree table engine because we wanted to summarize the hits values ​​for each date/project,

  • The content following AS  is the query to construct the materialized view.

We can create any number of materialized views, but each new materialized view is an additional storage burden, so keep the total number reasonable, that is, the number of materialized views under each table is controlled within 10.

Now we populate the target table of the materialized view using the same query as the  wikistat  table:

INSERT INTO wikistat_top_projects SELECT
    date(time) AS date,
    project,
    sum(hits) AS hits
FROM wikistat
GROUP BY
    date,
    project

Query materialized view table

Since wikistat_top_projects  is a table, we can use ClickHouse’s SQL function to query:

SELECT
    project,
    sum(hits) hits
FROM wikistat_top_projects
WHERE date = '2015-05-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10

┌─project─┬─────hits─┐
│ en      │ 34521803 │
│ es      │  4491590 │
│ de      │  4490097 │
│ fr      │  3390573 │
│ it      │  2015989 │
│ ja      │  1379148 │
│ pt      │  1259443 │
│ tr      │  1254182 │
│ zh      │   988780 │
│ pl      │   985607 │
└─────────┴──────────┘

10 rows in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 101.81 KB (2.83 million rows/s., 35.20 MB/s.)

Note that it only took ClickHouse 3ms to produce the same result, whereas the original query took 15 seconds. Also note that since the SummingMergeTree engine is asynchronous (this saves resources and reduces the impact on query processing) some values ​​may not have been calculated yet and we still need to use the GROUP here BY .

Manage materialized views

We can use SHOW TABLES query to list materialized views:

SHOW TABLES LIKE 'wikistat_top_projects_mv'

┌─name─────────────────────┐
│ wikistat_top_projects_mv │
└──────────────────────────┘

We can delete the materialized view using DROP TABLE , but this will only delete the trigger itself:

DROP TABLE wikistat_top_projects_mv

If the target table is no longer needed, remember to delete it as well:

DROP TABLE wikistat_top_projects

Get the size of the materialized view on disk

All metadata about the materialized view table is stored in the system database, like other tables. For example, to get its size on disk, we can do the following:

SELECT
    rows,
    formatReadableSize(total_bytes) AS total_bytes_on_disk
FROM system.tables
WHERE table = 'wikistat_top_projects'

┌──rows─┬─total_bytes_on_disk─┐
│ 15336 │ 37.42 KiB           │
└───────┴─────────────────────┘

Update data in materialized view

The most powerful feature of the materialized view is that when inserting data into the source table, the data in the target table will automatically use the  SELECT  statement Update:

picture

Therefore, we do not need to additionally refresh the data in the materialized view - ClickHouse does everything automatically. Suppose we insert new data into the  wikistat  table:

INSERT INTO wikistat
VALUES(now(), 'test', '', '', 10),
      (now(), 'test', '', '', 10),
      (now(), 'test', '', '', 20),
      (now(), 'test', '', '', 30);

Now, let us query the target table of the materialized view to verify that the  hits  column has been summarized correctly. We use the FINAL modifier to ensure that the SummingMergeTree engine returns summarized hits rather than individual, unmerged rows:

SELECT hits
FROM wikistat_top_projects
FINAL
WHERE (project = 'test') AND (date = date(now()))

┌─hits─┐
│   70 │
└──────┘

1 row in set. Elapsed: 0.005 sec. Processed 7.15 thousand rows, 89.37 KB (1.37 million rows/s., 17.13 MB/s.)

In a production environment, avoid using  FINAL  on large tables, and always prefer using  sum (hits) . Also check the optimize_on_insert parameter setting, which controls how inserted data is merged.

Accelerate aggregation using materialized views

As shown in the previous section, materialized views are a way to improve query performance. For analytical queries, common aggregation operations are not just sum() as shown in the previous example. SummingMergeTree is great for calculating summary data, but there are more advanced aggregations that can be calculated using the AggregatingMergeTree engine.

Suppose we frequently execute the following types of queries:

SELECT
    toDate(time) AS date,
    min(hits) AS min_hits_per_hour,
    max(hits) AS max_hits_per_hour,
    avg(hits) AS avg_hits_per_hour
FROM wikistat
WHERE project = 'en'
GROUP BY date

This gives us the monthly minimum, maximum, and average of daily clicks for a given item:

┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬──avg_hits_per_hour─┐
│ 2015-05-01 │                 1 │             36802 │  4.586310181621408 │
│ 2015-05-02 │                 1 │             23331 │  4.241388590780171 │
│ 2015-05-03 │                 1 │             24678 │  4.317835245126423 │
...
└────────────┴───────────────────┴───────────────────┴────────────────────┘

38 rows in set. Elapsed: 8.970 sec. Processed 994.11 million rows

Note that our raw data has been aggregated by hour.

We use materialized views to store these aggregated results for faster retrieval. Use state combinators to define aggregate results. The state combinator requires ClickHouse to save the internal aggregation state, not the final aggregation result. This allows using aggregation operations without saving all records with original values. This approach is simple - we use the *State() function when creating the materialized view, and then use its corresponding *Merge() function at query time to get the correct aggregate results:

picture

In our example we will use  min ,  max  and  avg  status. In the target table of the new materialized view, we will use the  AggregateFunction  type to store the aggregate state instead of the value:

CREATE TABLE wikistat_daily_summary
(
    `project` String,
    `date` Date,
    `min_hits_per_hour` AggregateFunction(min, UInt64),
    `max_hits_per_hour` AggregateFunction(max, UInt64),
    `avg_hits_per_hour` AggregateFunction(avg, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project, date);

Ok.

CREATE MATERIALIZED VIEW wikistat_daily_summary_mv
TO wikistat_daily_summary AS
SELECT
    project,
    toDate(time) AS date,
    minState(hits) AS min_hits_per_hour,
    maxState(hits) AS max_hits_per_hour,
    avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY project, date

Now, let's fill it with data:

INSERT INTO wikistat_daily_summary SELECT
    project,
    toDate(time) AS date,
    minState(hits) AS min_hits_per_hour,
    maxState(hits) AS max_hits_per_hour,
    avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY project, date

0 rows in set. Elapsed: 33.685 sec. Processed 994.11 million rows

At query time, we use the corresponding  Merge combinator to retrieve the value:

SELECT
    date,
    minMerge(min_hits_per_hour) min_hits_per_hour,
    maxMerge(max_hits_per_hour) max_hits_per_hour,
    avgMerge(avg_hits_per_hour) avg_hits_per_hour
FROM wikistat_daily_summary
WHERE project = 'en'
GROUP BY date

Note that we get exactly the same result, but thousands of times faster:

┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬──avg_hits_per_hour─┐
│ 2015-05-01 │                 1 │             36802 │  4.586310181621408 │
│ 2015-05-02 │                 1 │             23331 │  4.241388590780171 │
│ 2015-05-03 │                 1 │             24678 │  4.317835245126423 │
...
└────────────┴───────────────────┴───────────────────┴────────────────────┘

32 rows in set. Elapsed: 0.005 sec. Processed 9.54 thousand rows, 1.14 MB (1.76 million rows/s., 209.01 MB/s.)

Any aggregate function can be used with the State/Merge combinator as part of a polymerized view.

Compress data toOptimize storage

In some cases, we only need to store aggregate data, but the data is written in an event-based manner. If we still need the raw data for the last few days and can save the aggregated historical data, we can achieve this by combining the materialized view and the TTL of the source table.

In order to optimize storage space, we can also explicitly declare column types to ensure that the table structure is optimal. Suppose we want to store only monthly aggregated data for each path from the  wikistat  table:

CREATE MATERIALIZED VIEW wikistat_monthly_mv TO
wikistat_monthly AS
SELECT
    toDate(toStartOfMonth(time)) AS month,
    path,
    sum(hits) AS hits
FROM wikistat
GROUP BY
    path,
    month

The original table (data stored on an hourly basis) takes up 3 times as much disk space as the aggregated materialized view:

wikistat (original table) wikistat_daily (materialized view)
1.78GiB 565.68 MiB
1b rows ~ 27m rows

One point to note here is that compression only makes sense if the number of resulting rows is reduced by at least a factor of 10. In other cases, ClickHouse's powerful compression and encoding algorithms will exhibit storage efficiencies that match those without any aggregation.

Now that we have the monthly aggregation, we can add a TTL expression to the original table so that the data is deleted after 1 week:

ALTER TABLE wikistat MODIFY TTL time + INTERVAL 1 WEEK

Validate and filter data

Another popular example of using materialized views is to process data immediately after insertion. Data validation is a good example.

picture

Suppose we want to filter out all paths containing unwanted symbols and save them in the result table. Our table has about 1% of values ​​like this:

SELECT count(*)
FROM wikistat
WHERE NOT match(path, '[a-z0-9\\-]')
LIMIT 5

┌──count()─┐
│ 12168918 │
└──────────┘

1 row in set. Elapsed: 46.324 sec. Processed 994.11 million rows, 28.01 GB (21.46 million rows/s., 604.62 MB/s.)

To implement validation filtering, we need two tables - one with all data and one with only clean data. The target table of the materialized view will play the role of a final table with only clean data, and the source table will be temporary. We can delete data from the source table based on TTL, like we did in the previous section, or change the engine of this table to Null, which does not store any data (the data will only be stored in the materialized view) :

CREATE TABLE wikistat_src
(
    `time` DateTime,
    `project` LowCardinality(String),
    `subproject` LowCardinality(String),
    `path` String,
    `hits` UInt64
)
ENGINE = Null

Now, let's create a materialized view using a data validation query:

CREATE TABLE wikistat_clean AS wikistat;

Ok.

CREATE MATERIALIZED VIEW wikistat_clean_mv TO wikistat_clean
AS SELECT *
FROM wikistat_src
WHERE match(path, '[a-z0-9\\-]')

When we insert data,  wikistat_src  will remain empty:

INSERT INTO wikistat_src SELECT * FROM s3('https://ClickHouse-public-datasets.s3.amazonaws.com/wikistat/partitioned/wikistat*.native.zst') LIMIT 1000

Let's make sure the original table is empty:

SELECT count(*)
FROM wikistat_src

┌─count()─┐
│       0 │
└─────────┘

However, our  wikistat_clean materialized table now has only valid rows:

SELECT count(*)
FROM wikistat_clean

┌─count()─┐
│      58 │
└─────────┘

The other 942 rows (1000 - 58) were excluded by our validation statement when inserting.

Data routing to table

Another example where materialized views can be used is to route data to different tables based on certain conditions:

picture

For example, we might want to route invalid data to another table instead of deleting it. In this case, we create another materialized view, but with a different query:

CREATE TABLE wikistat_invalid AS wikistat;

Ok.

CREATE MATERIALIZED VIEW wikistat_invalid_mv TO wikistat_invalid
AS SELECT *
FROM wikistat_src
WHERE NOT match(path, '[a-z0-9\\-]')

When we have single materialized views for the same source table, they are processed in alphabetical order. Remember, do not create more than a few dozen materialized views for the source table, as insert performance may degrade.

If we insert the same data again, we will find 942 invalid rows in the  wikistat_invalid  materialized view:

SELECT count(*)
FROM wikistat_invalid

┌─count()─┐
│     942 │
└─────────┘

data conversion

Since materialized views are based on the results of queries, we can use the power of all ClickHouse functions in SQL to transform source values ​​to enrich and improve the clarity of the data. As a quick example, let's combine project, subproject and < /span> columns: hour and date column and time is split into page columns are merged into a single path

CREATE TABLE wikistat_human
(
    `date` Date,
    `hour` UInt8,
    `page` String
)
ENGINE = MergeTree
ORDER BY (page, date);

Ok.

CREATE MATERIALIZED VIEW wikistat_human_mv TO wikistat_human
AS SELECT
    date(time) AS date,
    toHour(time) AS hour,
    concat(project, if(subproject != '', '/', ''), subproject, '/', path) AS page,
    hits
FROM wikistat

Now, wikistat_human  will be populated with the transformed data:

┌───────date─┬─hour─┬─page──────────────────────────┬─hits─┐
│ 2015-11-08 │    8 │ en/m/Angel_Muñoz_(politician) │    1 │
│ 2015-11-09 │    3 │ en/m/Angel_Muñoz_(politician) │    1 │
└────────────┴──────┴───────────────────────────────┴──────┘

Create materialized views in production environment

When source data arrives, the new data is automatically added to the target table of the materialized view. However, in order to populate a materialized view with existing data in a production environment, we have to follow these simple steps:

1. Pause writing to the source table.

2. Create a materialized view.

3. Populate the target table with data from the source table.

4. Resume writing to the source table.

Alternatively, we can use a future point in time when creating the materialized view:

CREATE MATERIALIZED VIEW mv TO target_table
AS SELECT …
FROM soruce_table WHERE date > `$todays_date`

where $todays_date  should be replaced with an absolute date. Therefore, our materialized view will fire starting tomorrow, so we have to wait until tomorrow and populate the historical data with the following query:

INSERT INTO target_table
SELECT ...
FROM soruce_table WHERE date <= `$todays_date`

Materialized views and JOIN operations

Since materialized views work based on the results of SQL queries, we can use JOIN operations as well as any other SQL features. But JOIN operations should be used with caution.

Let's say we have a table with page titles:

CREATE TABLE wikistat_titles
(
    `path` String,
    `title` String
)
ENGINE = MergeTree
ORDER BY path

The titles in this table are associated with paths:

SELECT *
FROM wikistat_titles

┌─path─────────┬─title────────────────┐
│ Ana_Sayfa    │ Ana Sayfa - artist   │
│ Bruce_Jenner │ William Bruce Jenner │
└──────────────┴──────────────────────┘

Now we can create a materialized view from the  wikistat_titles  table via joinpath:titleValue connection

CREATE TABLE wikistat_with_titles
(
    `time` DateTime,
    `path` String,
    `title` String,
    `hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);

Ok.

CREATE MATERIALIZED VIEW wikistat_with_titles_mv TO wikistat_with_titles
AS SELECT time, path, title, hits
FROM wikistat AS w
INNER JOIN wikistat_titles AS wt ON w.path = wt.path

Note that we used  INNER JOIN , so after filling, we will only get  wikistat_titles There are records with corresponding values ​​in the table:

SELECT * FROM wikistat_with_titles LIMIT 5

┌────────────────time─┬─path──────┬─title──────────────┬─hits─┐
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │    5 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │    7 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │    1 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │    3 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │  653 │
└─────────────────────┴───────────┴────────────────────┴──────┘

We insert a new record into the  wikistat  table to see how our new materialized view works:

INSERT INTO wikistat VALUES(now(), 'en', '', 'Ana_Sayfa', 123);

1 row in set. Elapsed: 1.538 sec.

Note the insertion time here - 1.538 seconds. We can see our new line in  wikistat_with_titles :

SELECT *
FROM wikistat_with_titles
ORDER BY time DESC
LIMIT 3

┌────────────────time─┬─path─────────┬─title────────────────┬─hits─┐
│ 2023-01-03 08:43:14 │ Ana_Sayfa    │ Ana Sayfa - artist   │  123 │
│ 2015-06-30 23:00:00 │ Bruce_Jenner │ William Bruce Jenner │  115 │
│ 2015-06-30 23:00:00 │ Bruce_Jenner │ William Bruce Jenner │   55 │
└─────────────────────┴──────────────┴──────────────────────┴──────┘

But what happens if we add data to the  wikistat_titles  table? :

INSERT INTO wikistat_titles
VALUES('Academy_Awards', 'Oscar academy awards');

Even though we have the corresponding values ​​in the  wikistat  table, nothing will appear in the materialized view:

SELECT *
FROM wikistat_with_titles
WHERE path = 'Academy_Awards'

0 rows in set. Elapsed: 0.003 sec.

This is because the materialized view only fires when its source table receives an insert. It's just a trigger on the source table and knows nothing about the join table. Note that this does not only apply to join queries, but is relevant when introducing any external table in the SELECT statement of the materialized view, for example using  IN SELECT .

In our case,  wikistat  is the source table of the materialized view, and  wikistat_titles a>  is the table we want to join:

picture

That's why nothing appears in our materialized view - nothing is inserted into the  wikistat  table. But let's insert some content into it:

INSERT INTO wikistat VALUES(now(), 'en', '', 'Academy_Awards', 456);

We can see the new record in the materialized view:

SELECT *
FROM wikistat_with_titles
WHERE path = 'Academy_Awards'

┌────────────────time─┬─path───────────┬─title────────────────┬─hits─┐
│ 2023-01-03 08:56:50 │ Academy_Awards │ Oscar academy awards │  456 │
└─────────────────────┴────────────────┴──────────────────────┴──────┘

Be careful as JOIN operations may significantly reduce insert performance when joining large tables, as shown above. Consider using a dictionary as a more efficient alternative.

Summarize

In this blog post, we explore how materialized views can be a powerful tool in ClickHouse for improving query performance and expanding data management capabilities. You can even use materialized views with JOIN operations. Consider materialized columns as a quick alternative when aggregation or filtering is not required.

picture

contact us

Mobile number: 13910395701

Email: [email protected]

Meet all your online column analysisDatabase managementneeds

Guess you like

Origin blog.csdn.net/ClickHouseDB/article/details/132878884