Real-time materialized view: a powerful tool to accelerate large-scale time series data query

0f1378350877331d96aba7401aa78259.png

本文约5000字,建议阅读10分钟
本文将分享如何在产品中实现实时物化视图,重点聚焦于加速大规模时间序列数据查询。

Full-text catalog:

  1. Why do we need materialized views

  2. What is a real-time materialized view

  3. How to implement real-time materialized view

  4. Outlook and Summary

01. Why do we need a materialized view

In our daily life, we generate a large amount of data every day. According to statistics, in 2020 alone, humans will generate about 2.5 EB (ie 2.5 x 10^18 bytes) of data every day. It is estimated that by 2025, this number will reach 463EB (that is, 463 x 10^18 bytes), and the growth rate is very impressive. As the scale of data continues to expand, data analysis queries become more complex and time-consuming, and query acceleration becomes a key task for analysis.

ffde8077f96ec1fecd1b2f368c6d08cd.png

Commonly used analytical query acceleration methods mainly include the following:

  • Caching: By caching data from slow storage media to fast storage media, such as memory, faster data read responses can be obtained in the process of analyzing data, thereby achieving acceleration effects.

  • Parallel computing and distributed computing: decompose computing tasks into multiple sub-tasks for parallel processing, make full use of computing resources, and improve the speed and efficiency of analysis and query.

  • Data partitioning and indexing: reduce the amount of data that needs to be scanned and analyzed during query, so as to achieve the effect of acceleration.

  • Precomputation: Calculate and aggregate data in advance, and store the results materialized, so that the calculated results can be directly used for acceleration when querying. Materialized view is an important implementation of precomputing.

41597f9d793792bfa86f1585a01bac6f.png

02. What is a real-time materialized view

A materialized view is a pre-calculated result set of common time-consuming or complex queries, so that these pre-calculated results can be quickly accessed and used during query. The left table in the figure below is a product order information table, which contains simple product sales information. In the traditional way, every time the total sales of each product needs to be calculated, the data must be read from the original data table on the left and calculated, which is time-consuming and consumes system resources. Through the materialized view, we can pre-calculate the data in advance and store the calculation results as the result table on the right. For the next query, you only need to read the data directly from the materialized view of the right table without performing complex calculation operations, saving a lot of time and resource overhead, and greatly improving the efficiency and response speed of the query.

19eace878af3aba83e576afbb38dfe4b.png

However, in actual scenarios, as time goes by, the amount of data will continue to increase. When the total sales of goods are queried at different times, the results will change accordingly, and the stored materialized data also needs to be updated over time. It is time-consuming to update the full amount of data from time series data, and the materialized results will become invalid with the generation of new data. However, time series data usually has the Append-only feature, that is, the existing data will hardly change. In the materialized data and calculation process, the historical materialized calculation results can be reused. When new data arrives, only the original materialized pre-calculated data needs to be incrementally updated.

Real-time materialized view is a precomputing acceleration process suitable for time series data. Its core idea is to preserve historical materialized data and calculation results, and perform incremental updates when new data arrives. While efficiently managing and maintaining materialized operations, reduce computing overhead.

4f158348e3424faee622227b82d5cd38.png

There are three key points to realize real-time materialized view on time series:

  • Storage: In the real-time materialized view, it is necessary to define the structure of the materialized table and store the materialized data. Storage can use appropriate storage media and data structures to meet query performance and storage requirements. Through reasonable storage design and appropriate index compression method, efficient access and use of materialized data can be ensured.

  • Updates: Real-time materialized views require periodic or event-driven incremental updates. Periodic updating means that materialized data is updated according to a certain time interval, such as hourly, daily, or weekly. Event-driven updates trigger update operations based on data change events. Through periodic updates or event-driven updates, the consistency of materialized data and original data can be maintained, and the accuracy of query results can be ensured.

  • Precomputation: In the real-time materialized view, precomputation is an important link, and it is necessary to provide a precomputation method under incremental update.

23bb728722e09bc6b5e2a6ce46f851a6.png

03. How to realize real-time materialized view    

The Yanhuang product itself is an analysis platform for observable data. Observable data is naturally timestamped, and the query time window will naturally be added to the query during analysis. When designing the real-time materialized view of Yanhuang products, it is expected that the query of the materialized view can adapt to any time window and obtain accurate query results. Based on such an objective, the following realizations were carried out.

For the original time data that grows over time, it is usually split into small data blocks for storage. Sharding rules are usually based on import time range, cumulative data volume, other tag information, etc. for sharding. Precomputation is performed for each original time data fragment to obtain the corresponding materialized data fragment. When new data comes, only the materialized precomputation operation needs to be performed on the new data.

04c2fb88c5daa16db495591e8605e758.png

After obtaining the precomputed materialized data shards corresponding to each original data shard, these materialized shards are usually merged and then aggregated when querying. This process is similar to the common Map-reduce computing framework: the Map process splits data into small data blocks, distributes them to different computing nodes for calculation, and obtains partial aggregation results of some data. In the Reduce phase, the partial results are combined and then aggregated to obtain the final result and returned to the requester. In the implementation of the real-time materialized view, the materialized precomputation operation for each original data fragment is similar to the Map operation. This means that each shard is independently precomputed to produce a partial aggregated result. When querying results, these materialized data fragments need to be merged and aggregated, similar to the Reduce operation.

It should be noted that during the implementation of the real-time materialized view, the timing of precomputing each data fragment is flexible. It can separately perform some precomputation operations as it grows over time as needed. This means that different shards may perform precomputation at different points in time, and perform precomputation operations dynamically according to data changes and updates.

4f0d561c19a38d1f4d1d513de94932bc.png

233c66ceee405ad0747924e0c6987255.png

Take the average price as an example, as shown in the figure above, to calculate the average price based on the original data table ①, usually first sum and then count, and store the intermediate pre-calculated results ②. By analogy, for each original time series data shard, calculate the partial aggregation calculation operation of the corresponding part of the data. When querying the results, you only need to read part of the aggregation results from table ②, and combine them to get the average value in table ③.

When querying, the query time range usually does not cover the time range corresponding to the complete materialized shard, and may only include a part of the time range of some materialized shards, or may include the time range of data being imported or no materialized data. In order to adapt to queries in different time ranges, we expect the precomputed materialized data to retain a certain timeliness. The above commodity order information ① will naturally have timestamp information, and the obtained partial aggregation result ② is a very thin data table sum and quantity value, completely losing the time information. On this basis, perform time bucketing on the precalculated data, for example, bucket the results in ② for one hour to get the intermediate results in the green table at the bottom right. The precalculated intermediate results carry time information and include partial aggregation results of partial data under each hourly time bucket.

6c5d381de392ff88073e1b96c2cb1068.png

The above-mentioned process of calculating the average price AVG is abstracted into a calculation expression: the SUM and COUNT values ​​are retained during the pre-calculation process, and the SUM is usually summed and then divided by the SUM quantity value when the query result is requested at the end, so that Get the last query request result SUM(SUM)/SUM(COUNT). On this basis, if you want to time-bucket the pre-materialized data, it is equivalent to adding the GROUP BY operation in the pre-calculation step. See the figure below for details.

3cbcb23d9677fc113c51f2ba7bc3ecb4.png

8ad967afce7e567c32dd71a5dfdbb409.png

So far, we have obtained the precalculated materialized result shards corresponding to each original data shard, and each materialized shard contains time bucket information. When querying a materialized view, the query time window may include the complete range of materialized shards, or the time range of data shards being imported or not materialized. As shown in the figure above, the time window is divided according to whether it has been precomputed and materialized. In the time range corresponding to materialized slices 4-7, the stored results in the pre-computed materialized slices can be used to participate in the calculation, which is the first part P1. For the data being imported on the right, the data fragmentation has not yet been placed in the memory, and the data needs to be directly read from the original data to participate in the calculation, which is the second part P2. For the remaining time range on the left, part of the time range of materialized shard 3 is included. Since the data on materialized shard 3 has been partially aggregated, the complete original time information has been lost, so there is no way to fully utilize the materialized shard Part of the aggregation results of are involved in the calculation. Split this part of data into time buckets, and the data that falls into the complete time bucket can be read from the materialized sharding results to participate in the calculation as the third part P3. For time ranges that do not fall within the complete time bucket range, data can only be read from the original data shard to participate in the calculation, as the fourth part P4. Usually P1 plus P3 account for a relatively high proportion of the query time range. When we query the materialized view, we pre-calculate and accelerate this part of the data, thereby accelerating the entire query.

62b3fd146fb2d964e862bec5aa721f19.png

The division of P3 and P4 is somewhat abstract. Assume that the query start time is 11:22, the bucket size of a given time is 1 hour (1h), and the time buckets occupied by materialized shard 3 are from 10:00 to 14:00 . Then with 12:00 as the dividing line, the part of time from 12:00 to 14:00 has fallen into the range of the complete time bucket, which is the P3 part. However, 11:22-12:00 does not completely fall into the range of the time bucket, and the time must be read from the original data to participate in the calculation, which is the P4 part.

6c1c897ad981b2822c615593f75296f9.png

Above, we divided the query time range into four parts. The first part is the data containing the time range in which the fully materialized shard is. The second part is the time range of the data that is not materialized. This can be the time range of the data being imported, or the time range of the data that has not had time to perform materialized precomputation after the materialized view is created. The third part is the data that has been materialized but only part of the data results in the time bucket are taken. Although the fourth part has been materialized, due to the loss of part of the time information, there is no way to directly use the participating calculations from the partial aggregation results of the time bucket. Based on the division of these four time windows, for the data in the time windows of P2 and P4, pre-computation-related operations need to be performed during query. For example, AVG needs to calculate SUM/COUNT. Aggregation operations have been performed on P1 and P3 data, and the aggregation results can be read directly from the materialized shards. Get these two parts of data and then aggregate all SUM and COUNT to get the final query result. In this way, the goal of querying the materialized view to adapt to any time window is completed, and the result consistency is guaranteed.

34bd08e92380a812032085aca741a45c.png

Yanhuang products provide SQL interface for query analysis, and define the DDL for materialized view creation for the above implementation. As shown in the figure above, given the WITH statement in DDL, two parameters for creating a materialized view are provided. The part of the unmaterialized data in the four time windows mentioned above needs to be read from the original data. When the degree of materialization is relatively high, the proportion of this part of the data will be very small. When the requester does not need very accurate query In the result, this part of the data can often be ignored and does not participate in the calculation. When DDL modeling, you can specify MATERIALIZED_ONLY=true to ignore this part of the data in the query to further speed up the query analysis. TIME_BUCKET is the configuration of the aforementioned time bucket. This parameter needs to be customized according to the time range of common queries and the time distribution of data. In addition, when creating a materialized view, if you do not need to pay attention to the situation of past data, you can specify WITH NO DATA to not perform background precomputation operations on past data. At the same time, Yanhuang products provide the SHOW MATERIALIZED VIEW statement to display the basic information and status of the materialized view.

ee68b4f215b63dc5c5aa6bf483742969.png

The figure above defines a simple data set in Yanhuang products, and the query itself has a time window. Click on each piece of data to see each field on the data. Create a materialized view with a time bucket size of 1 hour on this dataset and calculate the average as shown below.

ddf53e1324b8ce503e0e99a5ef68004c.png

Through the SHOW statement, you can see the basic information of the materialized view, and provide a status display corresponding to the materialization progress. The materialization progress is expressed by the ratio of the materialized data fragments to all the materialized data fragments.

bb7254c18397d53cb38cc55f46af7730.png

In Yanhuang products, the pre-computation results are stored in the Parquet data format. Parquet is a grid storage that uses efficient compression algorithms and data encoding methods to achieve data reduction, which can reduce data IO and improve performance. Real-time materialized view updates are automatically maintained by the system and are driven by data change events. When new data arrives or certain conditions are met, the system will automatically trigger corresponding precomputation and update operations. The scope of update operations is based on data fragmentation. This event-driven approach enables materialized views to flexibly respond to data changes and queries request to keep the materialized view real-time. Most of the examples described above focus on aggregation. The definition of materialized views can also use non-aggregated queries. When we only need certain fields or specific data under certain filter conditions, we can also create materialized views to speed up query analysis. .

The performance of materialized views in this implementation is closely related to factors such as the time distribution of data, the time range of queries, the size of fragments, and the size of time buckets. Here we use the product to conduct a simple performance test, open the debug log in the Yanhuang system, about 100 million pieces of data can be collected every day, occupying 1000 data fragments. When querying the number of times each data set is used on such data, it takes 40s.

d58fa70283277385fc720c45d995bd58.png

Create real-time materialized views with bucket sizes of 1 hour and 1 day on the dataset respectively. As shown in the figure below, you can see that the query performance has been significantly improved. The query efficiency of the materialized view with a time bucket size of 1 day is better than that of 1 hour. This is because the query time window size is 1 day and the time bucket size is the same. The original data slice storage size is 15M/Slice, and the materialized slices under different time buckets have the same advantages in storage, thanks to the high-performance storage of Parquet.

7fd63ed587605db55e21b94aa9b8f129.png

04. Outlook and Summary

The previous article introduced the realization of real-time materialized view in the product. We'll also explore materialized views further:

  • Intelligent routing: When querying the original data, when some aggregation logic or filtering logic in the materialized view is used, the system can automatically rewrite the query based on the original data into a query based on the materialized view, perform intelligent routing, and automatically accelerate the query.

  • Hierarchical materialization: The materialized storage mentioned in the article is mainly divided according to the time range, but in fact, the storage of the physical table of the materialized result can be further partitioned according to specific conditions, thereby further accelerating.

  • Another ETL: The maintenance process of the materialized view is to read data from the original data, perform pre-calculation operations, and then store the pre-calculation results. This process is very similar to ETL. All computing operations on data in the system can be expressed in DLL, including indexing process, query process, etc.

2022a0266d8d3bb8f9a03f0066c7e1f2.png

This time, we mainly share the realization of real-time materialized view. The real-time materialized view pre-materializes and stores the materialized data, so that the query operation can directly obtain the data from the materialized view without re-performing complex operations on the original time series data, thus greatly improving the query performance. In the implementation, the data is time-bucketed, and data events are used to trigger automatic maintenance and updates, which maintains the real-time consistency between materialized view queries and original data queries. However, real-time materialized views will bring additional storage space and maintenance costs. To use materialized views, you need to combine actual scenarios and fully consider the query time window range and data time distribution to customize analysis acceleration.

3cbe0881827ed5dfcdbf06145e2f5ac6.png

There are still many aspects worth exploring in the materialized view, and I look forward to having the opportunity to communicate and share with you in the future.

05、Q&A

Q1: Can materialized views replace existing report models?

A: Materialized view acceleration will be used in reports. In many cases, reports are frequently viewed, and the queries behind the reports usually use materialized view queries to speed up report display.

Q2: The implementation of the specific materialized view is mentioned here, which framework is used to implement it?

A: This is the materialized view implementation of Yanhuang products, and it does not depend on other frameworks.

Q3: Are there any requirements for the aggregation functions involved here?

A: During the whole sharing process, the basic implementation ideas of real-time materialized view are described. The pre-computation process is related to the calculation engine at the bottom of the system. When the calculation engine can support the operation decomposition of pre-aggregation and post-aggregation of aggregate calculation, it can be used in real-time materialized view. In general, materialized views do not support non-deterministic aggregate calculations, nonlinear aggregate calculations, aggregate calculations that depend on dynamic parameters, etc.

Editor: Yu Tengkai

Proofreading: Lin Yilin

1593af6b144a9339451f24c074d912fa.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/132013822