Number of words in this article: 11558; estimated reading time: 29 minutes
作宇:Denys Golotiuk
Reviewer: Zhuang Xiaodong (Weizhuang)
introduce
In the real world, data not only needs to be stored but also processed. Processing is usually done on the application side. However, some key processing points can be moved to ClickHouse to improve data performance and manageability. One of the most powerful tools in ClickHouse is materialized views. In this article, we'll explore materialized views and how they accomplish tasks such as accelerating queries as well as data transformation, filtering, and routing.
If you would like to learn more about materialized views, we are offering a free training course later.
What is a materialized view?
The materialized view is a special trigger. When data is inserted, it executes the SELECT query on the data and stores the result as To a target table:
This is useful in many scenarios, let's look at the most popular one - making certain queries faster.
Quick example
Take Wikistat’s 1 billion row data set as an example:
CREATE TABLE wikistat
(
`time` DateTime CODEC(Delta(4), ZSTD(1)),
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);
Ok.
INSERT INTO wikistat SELECT *
FROM s3('https://ClickHouse-public-datasets.s3.amazonaws.com/wikistat/partitioned/wikistat*.native.zst') LIMIT 1e9
Suppose we frequently query the most popular items on a certain date:
SELECT
project,
sum(hits) AS h
FROM wikistat
WHERE date(time) = '2015-05-01'
GROUP BY project
ORDER BY h DESC
LIMIT 10
This query takes 15 seconds to complete on the test instance:
┌─project─┬────────h─┐
│ en │ 34521803 │
│ es │ 4491590 │
│ de │ 4490097 │
│ fr │ 3390573 │
│ it │ 2015989 │
│ ja │ 1379148 │
│ pt │ 1259443 │
│ tr │ 1254182 │
│ zh │ 988780 │
│ pl │ 985607 │
└─────────┴──────────┘
10 rows in set. Elapsed: 14.869 sec. Processed 972.80 million rows, 10.53 GB (65.43 million rows/s., 708.05 MB/s.)
If we have a large number of queries like this and we need millisecond performance from ClickHouse, we can create a materialized view for this query:
CREATE TABLE wikistat_top_projects
(
`date` Date,
`project` LowCardinality(String),
`hits` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (date, project);
Ok.
CREATE MATERIALIZED VIEW wikistat_top_projects_mv TO wikistat_top_projects AS
SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project;
In these two queries:
-
wikistat_top_projects is the name of the table we want to use to save the materialized view,
-
wikistat_top_projects_mv is the name of the materialized view itself (trigger),
-
We used the SummingMergeTree table engine because we wanted to summarize the hits values for each date/project,
-
The content following AS is the query to construct the materialized view.
We can create any number of materialized views, but each new materialized view is an additional storage burden, so keep the total number reasonable, that is, the number of materialized views under each table is controlled within 10.
Now we populate the target table of the materialized view using the same query as the wikistat table:
INSERT INTO wikistat_top_projects SELECT
date(time) AS date,
project,
sum(hits) AS hits
FROM wikistat
GROUP BY
date,
project
Query materialized view table
Since wikistat_top_projects is a table, we can use ClickHouse’s SQL function to query:
SELECT
project,
sum(hits) hits
FROM wikistat_top_projects
WHERE date = '2015-05-01'
GROUP BY project
ORDER BY hits DESC
LIMIT 10
┌─project─┬─────hits─┐
│ en │ 34521803 │
│ es │ 4491590 │
│ de │ 4490097 │
│ fr │ 3390573 │
│ it │ 2015989 │
│ ja │ 1379148 │
│ pt │ 1259443 │
│ tr │ 1254182 │
│ zh │ 988780 │
│ pl │ 985607 │
└─────────┴──────────┘
10 rows in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 101.81 KB (2.83 million rows/s., 35.20 MB/s.)
Note that it only took ClickHouse 3ms to produce the same result, whereas the original query took 15 seconds. Also note that since the SummingMergeTree engine is asynchronous (this saves resources and reduces the impact on query processing) some values may not have been calculated yet and we still need to use the GROUP here BY .
Manage materialized views
We can use SHOW TABLES query to list materialized views:
SHOW TABLES LIKE 'wikistat_top_projects_mv'
┌─name─────────────────────┐
│ wikistat_top_projects_mv │
└──────────────────────────┘
We can delete the materialized view using DROP TABLE , but this will only delete the trigger itself:
DROP TABLE wikistat_top_projects_mv
If the target table is no longer needed, remember to delete it as well:
DROP TABLE wikistat_top_projects
Get the size of the materialized view on disk
All metadata about the materialized view table is stored in the system database, like other tables. For example, to get its size on disk, we can do the following:
SELECT
rows,
formatReadableSize(total_bytes) AS total_bytes_on_disk
FROM system.tables
WHERE table = 'wikistat_top_projects'
┌──rows─┬─total_bytes_on_disk─┐
│ 15336 │ 37.42 KiB │
└───────┴─────────────────────┘
Update data in materialized view
The most powerful feature of the materialized view is that when inserting data into the source table, the data in the target table will automatically use the SELECT statement Update:
Therefore, we do not need to additionally refresh the data in the materialized view - ClickHouse does everything automatically. Suppose we insert new data into the wikistat table:
INSERT INTO wikistat
VALUES(now(), 'test', '', '', 10),
(now(), 'test', '', '', 10),
(now(), 'test', '', '', 20),
(now(), 'test', '', '', 30);
Now, let us query the target table of the materialized view to verify that the hits column has been summarized correctly. We use the FINAL modifier to ensure that the SummingMergeTree engine returns summarized hits rather than individual, unmerged rows:
SELECT hits
FROM wikistat_top_projects
FINAL
WHERE (project = 'test') AND (date = date(now()))
┌─hits─┐
│ 70 │
└──────┘
1 row in set. Elapsed: 0.005 sec. Processed 7.15 thousand rows, 89.37 KB (1.37 million rows/s., 17.13 MB/s.)
In a production environment, avoid using FINAL on large tables, and always prefer using sum (hits) . Also check the optimize_on_insert parameter setting, which controls how inserted data is merged.
Accelerate aggregation using materialized views
As shown in the previous section, materialized views are a way to improve query performance. For analytical queries, common aggregation operations are not just sum() as shown in the previous example. SummingMergeTree is great for calculating summary data, but there are more advanced aggregations that can be calculated using the AggregatingMergeTree engine.
Suppose we frequently execute the following types of queries:
SELECT
toDate(time) AS date,
min(hits) AS min_hits_per_hour,
max(hits) AS max_hits_per_hour,
avg(hits) AS avg_hits_per_hour
FROM wikistat
WHERE project = 'en'
GROUP BY date
This gives us the monthly minimum, maximum, and average of daily clicks for a given item:
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬──avg_hits_per_hour─┐
│ 2015-05-01 │ 1 │ 36802 │ 4.586310181621408 │
│ 2015-05-02 │ 1 │ 23331 │ 4.241388590780171 │
│ 2015-05-03 │ 1 │ 24678 │ 4.317835245126423 │
...
└────────────┴───────────────────┴───────────────────┴────────────────────┘
38 rows in set. Elapsed: 8.970 sec. Processed 994.11 million rows
Note that our raw data has been aggregated by hour.
We use materialized views to store these aggregated results for faster retrieval. Use state combinators to define aggregate results. The state combinator requires ClickHouse to save the internal aggregation state, not the final aggregation result. This allows using aggregation operations without saving all records with original values. This approach is simple - we use the *State() function when creating the materialized view, and then use its corresponding *Merge() function at query time to get the correct aggregate results:
In our example we will use min , max and avg status. In the target table of the new materialized view, we will use the AggregateFunction type to store the aggregate state instead of the value:
CREATE TABLE wikistat_daily_summary
(
`project` String,
`date` Date,
`min_hits_per_hour` AggregateFunction(min, UInt64),
`max_hits_per_hour` AggregateFunction(max, UInt64),
`avg_hits_per_hour` AggregateFunction(avg, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (project, date);
Ok.
CREATE MATERIALIZED VIEW wikistat_daily_summary_mv
TO wikistat_daily_summary AS
SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY project, date
Now, let's fill it with data:
INSERT INTO wikistat_daily_summary SELECT
project,
toDate(time) AS date,
minState(hits) AS min_hits_per_hour,
maxState(hits) AS max_hits_per_hour,
avgState(hits) AS avg_hits_per_hour
FROM wikistat
GROUP BY project, date
0 rows in set. Elapsed: 33.685 sec. Processed 994.11 million rows
At query time, we use the corresponding Merge combinator to retrieve the value:
SELECT
date,
minMerge(min_hits_per_hour) min_hits_per_hour,
maxMerge(max_hits_per_hour) max_hits_per_hour,
avgMerge(avg_hits_per_hour) avg_hits_per_hour
FROM wikistat_daily_summary
WHERE project = 'en'
GROUP BY date
Note that we get exactly the same result, but thousands of times faster:
┌───────date─┬─min_hits_per_hour─┬─max_hits_per_hour─┬──avg_hits_per_hour─┐
│ 2015-05-01 │ 1 │ 36802 │ 4.586310181621408 │
│ 2015-05-02 │ 1 │ 23331 │ 4.241388590780171 │
│ 2015-05-03 │ 1 │ 24678 │ 4.317835245126423 │
...
└────────────┴───────────────────┴───────────────────┴────────────────────┘
32 rows in set. Elapsed: 0.005 sec. Processed 9.54 thousand rows, 1.14 MB (1.76 million rows/s., 209.01 MB/s.)
Any aggregate function can be used with the State/Merge combinator as part of a polymerized view.
Compress data toOptimize storage
In some cases, we only need to store aggregate data, but the data is written in an event-based manner. If we still need the raw data for the last few days and can save the aggregated historical data, we can achieve this by combining the materialized view and the TTL of the source table.
In order to optimize storage space, we can also explicitly declare column types to ensure that the table structure is optimal. Suppose we want to store only monthly aggregated data for each path from the wikistat table:
CREATE MATERIALIZED VIEW wikistat_monthly_mv TO
wikistat_monthly AS
SELECT
toDate(toStartOfMonth(time)) AS month,
path,
sum(hits) AS hits
FROM wikistat
GROUP BY
path,
month
The original table (data stored on an hourly basis) takes up 3 times as much disk space as the aggregated materialized view:
wikistat (original table) | wikistat_daily (materialized view) |
---|---|
1.78GiB | 565.68 MiB |
1b rows | ~ 27m rows |
One point to note here is that compression only makes sense if the number of resulting rows is reduced by at least a factor of 10. In other cases, ClickHouse's powerful compression and encoding algorithms will exhibit storage efficiencies that match those without any aggregation.
Now that we have the monthly aggregation, we can add a TTL expression to the original table so that the data is deleted after 1 week:
ALTER TABLE wikistat MODIFY TTL time + INTERVAL 1 WEEK
Validate and filter data
Another popular example of using materialized views is to process data immediately after insertion. Data validation is a good example.
Suppose we want to filter out all paths containing unwanted symbols and save them in the result table. Our table has about 1% of values like this:
SELECT count(*)
FROM wikistat
WHERE NOT match(path, '[a-z0-9\\-]')
LIMIT 5
┌──count()─┐
│ 12168918 │
└──────────┘
1 row in set. Elapsed: 46.324 sec. Processed 994.11 million rows, 28.01 GB (21.46 million rows/s., 604.62 MB/s.)
To implement validation filtering, we need two tables - one with all data and one with only clean data. The target table of the materialized view will play the role of a final table with only clean data, and the source table will be temporary. We can delete data from the source table based on TTL, like we did in the previous section, or change the engine of this table to Null, which does not store any data (the data will only be stored in the materialized view) :
CREATE TABLE wikistat_src
(
`time` DateTime,
`project` LowCardinality(String),
`subproject` LowCardinality(String),
`path` String,
`hits` UInt64
)
ENGINE = Null
Now, let's create a materialized view using a data validation query:
CREATE TABLE wikistat_clean AS wikistat;
Ok.
CREATE MATERIALIZED VIEW wikistat_clean_mv TO wikistat_clean
AS SELECT *
FROM wikistat_src
WHERE match(path, '[a-z0-9\\-]')
When we insert data, wikistat_src will remain empty:
INSERT INTO wikistat_src SELECT * FROM s3('https://ClickHouse-public-datasets.s3.amazonaws.com/wikistat/partitioned/wikistat*.native.zst') LIMIT 1000
Let's make sure the original table is empty:
SELECT count(*)
FROM wikistat_src
┌─count()─┐
│ 0 │
└─────────┘
However, our wikistat_clean materialized table now has only valid rows:
SELECT count(*)
FROM wikistat_clean
┌─count()─┐
│ 58 │
└─────────┘
The other 942 rows (1000 - 58) were excluded by our validation statement when inserting.
Data routing to table
Another example where materialized views can be used is to route data to different tables based on certain conditions:
For example, we might want to route invalid data to another table instead of deleting it. In this case, we create another materialized view, but with a different query:
CREATE TABLE wikistat_invalid AS wikistat;
Ok.
CREATE MATERIALIZED VIEW wikistat_invalid_mv TO wikistat_invalid
AS SELECT *
FROM wikistat_src
WHERE NOT match(path, '[a-z0-9\\-]')
When we have single materialized views for the same source table, they are processed in alphabetical order. Remember, do not create more than a few dozen materialized views for the source table, as insert performance may degrade.
If we insert the same data again, we will find 942 invalid rows in the wikistat_invalid materialized view:
SELECT count(*)
FROM wikistat_invalid
┌─count()─┐
│ 942 │
└─────────┘
data conversion
Since materialized views are based on the results of queries, we can use the power of all ClickHouse functions in SQL to transform source values to enrich and improve the clarity of the data. As a quick example, let's combine project, subproject and < /span> columns: hour and date column and time is split into page columns are merged into a single path
CREATE TABLE wikistat_human
(
`date` Date,
`hour` UInt8,
`page` String
)
ENGINE = MergeTree
ORDER BY (page, date);
Ok.
CREATE MATERIALIZED VIEW wikistat_human_mv TO wikistat_human
AS SELECT
date(time) AS date,
toHour(time) AS hour,
concat(project, if(subproject != '', '/', ''), subproject, '/', path) AS page,
hits
FROM wikistat
Now, wikistat_human will be populated with the transformed data:
┌───────date─┬─hour─┬─page──────────────────────────┬─hits─┐
│ 2015-11-08 │ 8 │ en/m/Angel_Muñoz_(politician) │ 1 │
│ 2015-11-09 │ 3 │ en/m/Angel_Muñoz_(politician) │ 1 │
└────────────┴──────┴───────────────────────────────┴──────┘
Create materialized views in production environment
When source data arrives, the new data is automatically added to the target table of the materialized view. However, in order to populate a materialized view with existing data in a production environment, we have to follow these simple steps:
1. Pause writing to the source table.
2. Create a materialized view.
3. Populate the target table with data from the source table.
4. Resume writing to the source table.
Alternatively, we can use a future point in time when creating the materialized view:
CREATE MATERIALIZED VIEW mv TO target_table
AS SELECT …
FROM soruce_table WHERE date > `$todays_date`
where $todays_date should be replaced with an absolute date. Therefore, our materialized view will fire starting tomorrow, so we have to wait until tomorrow and populate the historical data with the following query:
INSERT INTO target_table
SELECT ...
FROM soruce_table WHERE date <= `$todays_date`
Materialized views and JOIN operations
Since materialized views work based on the results of SQL queries, we can use JOIN operations as well as any other SQL features. But JOIN operations should be used with caution.
Let's say we have a table with page titles:
CREATE TABLE wikistat_titles
(
`path` String,
`title` String
)
ENGINE = MergeTree
ORDER BY path
The titles in this table are associated with paths:
SELECT *
FROM wikistat_titles
┌─path─────────┬─title────────────────┐
│ Ana_Sayfa │ Ana Sayfa - artist │
│ Bruce_Jenner │ William Bruce Jenner │
└──────────────┴──────────────────────┘
Now we can create a materialized view from the wikistat_titles table via joinpath:titleValue connection
CREATE TABLE wikistat_with_titles
(
`time` DateTime,
`path` String,
`title` String,
`hits` UInt64
)
ENGINE = MergeTree
ORDER BY (path, time);
Ok.
CREATE MATERIALIZED VIEW wikistat_with_titles_mv TO wikistat_with_titles
AS SELECT time, path, title, hits
FROM wikistat AS w
INNER JOIN wikistat_titles AS wt ON w.path = wt.path
Note that we used INNER JOIN , so after filling, we will only get wikistat_titles There are records with corresponding values in the table:
SELECT * FROM wikistat_with_titles LIMIT 5
┌────────────────time─┬─path──────┬─title──────────────┬─hits─┐
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │ 5 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │ 7 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │ 1 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │ 3 │
│ 2015-05-01 01:00:00 │ Ana_Sayfa │ Ana Sayfa - artist │ 653 │
└─────────────────────┴───────────┴────────────────────┴──────┘
We insert a new record into the wikistat table to see how our new materialized view works:
INSERT INTO wikistat VALUES(now(), 'en', '', 'Ana_Sayfa', 123);
1 row in set. Elapsed: 1.538 sec.
Note the insertion time here - 1.538 seconds. We can see our new line in wikistat_with_titles :
SELECT *
FROM wikistat_with_titles
ORDER BY time DESC
LIMIT 3
┌────────────────time─┬─path─────────┬─title────────────────┬─hits─┐
│ 2023-01-03 08:43:14 │ Ana_Sayfa │ Ana Sayfa - artist │ 123 │
│ 2015-06-30 23:00:00 │ Bruce_Jenner │ William Bruce Jenner │ 115 │
│ 2015-06-30 23:00:00 │ Bruce_Jenner │ William Bruce Jenner │ 55 │
└─────────────────────┴──────────────┴──────────────────────┴──────┘
But what happens if we add data to the wikistat_titles table? :
INSERT INTO wikistat_titles
VALUES('Academy_Awards', 'Oscar academy awards');
Even though we have the corresponding values in the wikistat table, nothing will appear in the materialized view:
SELECT *
FROM wikistat_with_titles
WHERE path = 'Academy_Awards'
0 rows in set. Elapsed: 0.003 sec.
This is because the materialized view only fires when its source table receives an insert. It's just a trigger on the source table and knows nothing about the join table. Note that this does not only apply to join queries, but is relevant when introducing any external table in the SELECT statement of the materialized view, for example using IN SELECT .
In our case, wikistat is the source table of the materialized view, and wikistat_titles a> is the table we want to join:
That's why nothing appears in our materialized view - nothing is inserted into the wikistat table. But let's insert some content into it:
INSERT INTO wikistat VALUES(now(), 'en', '', 'Academy_Awards', 456);
We can see the new record in the materialized view:
SELECT *
FROM wikistat_with_titles
WHERE path = 'Academy_Awards'
┌────────────────time─┬─path───────────┬─title────────────────┬─hits─┐
│ 2023-01-03 08:56:50 │ Academy_Awards │ Oscar academy awards │ 456 │
└─────────────────────┴────────────────┴──────────────────────┴──────┘
Be careful as JOIN operations may significantly reduce insert performance when joining large tables, as shown above. Consider using a dictionary as a more efficient alternative.
Summarize
In this blog post, we explore how materialized views can be a powerful tool in ClickHouse for improving query performance and expanding data management capabilities. You can even use materialized views with JOIN operations. Consider materialized columns as a quick alternative when aggregation or filtering is not required.
contact us
Mobile number: 13910395701
Email: [email protected]
Meet all your online column analysisDatabase managementneeds