Intensive lecture on practical application cases of algorithms - the application practice of Bilibili's massive user behavior analysis based on ClickHouse

1. Background introduction

The data-driven concept has been well known in all walks of life, and the core links include data collection, point-buried planning, data modeling, data analysis and index system construction. In the field of user behavior data, information extraction and model integration of common multi-dimensional data models can form a set of common data analysis methods to discover the inner relationship of user behavior, better insight into user behavior habits and behavior rules, and help enterprises Mining the commercial value of user data.

The paper involves the packing problem, please refer to [Digital-Analog Application] Packing Problem and [Digital-Analog Application] Packing Problem (Application Case)

The earliest in the industry can be traced back to the Google Analytics buried point analysis tool, and Baidu’s big data analysis platform was the first to start research in this area in China; with the rise of domestic big data 15 years later, Sensors’ user behavior analysis platform and GrowthIO’s growth platform and other independent data analysis platform companies have been established one after another; 18 years later, some fast-growing manufacturers have also established their own analysis platforms after several years of data accumulation, such as the Ocean behavior analysis platform of Meituan Dianping, the Byte Volcano Engine growth analysis platform, etc. wait.

Only when the data reaches a certain scale, it is more suitable to use scientific methods to improve the efficiency of data analysis. As mentioned above, although Google and Baidu were the first to explore this area, some Internet companies later only have their own products after a few years. That is, the development of data products needs to be consistent with the actual data scale and business development. Station B began to pay attention to the construction of big data in 2019, and now has a set of relatively mature data products - Polaris, which can realize functions such as buried point collection, buried point testing, buried point management, and behavioral data analysis of user behavior data. . The behavioral data analysis platform mainly includes the functional modules listed in the figure below. This article introduces the principle of the main modules and the implementation of related technologies.

2. Evolution of technical solutions

The Polaris User Behavior Analysis (UBA) module has mainly had three iterations since 2019.

2019~Mid 2020: [partial modeling aggregation + Spark Jar task]

The main task at this stage is to realize the function. According to the user's front-end query parameters, submit the Spark Jar job and wait for the result to be returned. Different analysis modules correspond to different Spark Jar jobs, and also correspond to different processed user behavior models. The data structure is shown in the figure below:

Although the function can be realized to a certain extent, there are obvious disadvantages:

  • Partial modeling: user dimension information needs to be processed into the model table in advance, and it is not easy to change and operate and maintain later, and the early analysis model design does not support private parameter query, that is, only part of the detailed data information is retained;
  • Resource self-adaptation problem: Spark Jar tasks need to apply for resources separately through YARN every time they are started. The calculation complexity of tasks corresponding to different query conditions is different, but the task resource parameters are fixed. On the one hand, it takes a long time to apply for and allocate resources. On the other hand, it cannot dynamically adapt to the complexity of the task. Even if a SparkSession resident in memory is maintained for the query task to call, it cannot solve the problem of resource adaptation based on the query task;
  • Concurrency is limited: There are too many query requests in the same time period, and subsequent requests will always wait for the Spark tasks corresponding to the previous requests to release the occupied resources, and the resources are not isolated, which will affect other normal ADHOC queries.

In actual use, the calculation time is too long. It takes more than 3 minutes to return results for single event analysis, and more than 30 minutes for funnel and path analysis, resulting in extremely low product availability, low query stability and success rate, and not many users. . The buried point management and reporting format at this stage are not fully standardized, so the focus is still on the latter.

Mid-2020~Mid-2021: [No model details + Flink + ClickHouse]

ClickHouse is a columnar database management system open sourced by Yandex in 2016. The core product of Yandex is a search engine, which is very dependent on traffic and online advertising business, so ClickHouse is naturally suitable for user traffic analysis. Station B began to introduce ClickHouse in 2020, and combined with Polaris behavior analysis scenarios for reconstruction, as shown in the following figure:

Here, consumption starts directly from the original data, and the data is directly washed into ClickHouse through the Flink cleaning task to generate user behavior details, which can be called model-free detailed data. The Redis dimension table is used for real-time user attribute association, the dictionary service is used to convert String-type entity IDs into Bigint, and the native RoaringBitMap function of ClickHouse is used to calculate the intersection and difference of the behavioral groups participating in the calculation. This generation realizes the real-time embedded point effect viewing. Since the launch of Polaris products, the number of weekly active users has increased by more than 300%. Compared with the previous generation, the performance has been greatly improved:

  • Query speed is greatly improved: 90% of event analysis queries can return query results within 5 seconds, 90% of funnel queries can return query results within 30S, and the speed has increased by more than 98%;
  • Real-time query: It can analyze the real-time user behavior data of the day, which greatly increases the timeliness of users obtaining analysis results.

But this kind of performance improvement itself is based on resource consumption. Taking mobile logs as an example, the peak value of Flink consumption tasks can reach millions per second, which poses great challenges to Redis dimension table association and dictionary service processing, and the calculation concurrency can even reach 1200 cores. If the power supply is cut off, the cost of manual operation and maintenance will also be greatly consumed. In addition, in this Lambda data flow architecture, the logic of real-time and offline cleaning needs to be consistent, otherwise it is easy to increase the cost of data interpretation. In addition, the real-time + offline maintenance of two sets is also a great waste of storage, that is, Kafka, Hive, and CK all need to store the same data. By the end of 2021, with business development, CK storage has undergone several horizontal expansions, leaving less than 10%, and cluster expansion and data migration also require a lot of energy, which will be introduced in detail in the later sections of this article. In terms of functions, directly applying the native CK function query to the detailed data takes minutes for cross-day retention analysis and path analysis, and the experience is not very good.

From mid-2021 to today: [Iceberg full model aggregation + ClickHouse]

Since 22 years, the company has vigorously promoted cost reduction and efficiency increase, which requires maximizing behavioral analysis product performance with as few resources as possible. The overall core idea is full-model aggregation acceleration, the underlying traffic data link uses the kappa architecture, and there will be no inconsistency between the Polaris application data and the traffic table, and the data will be output on an hourly basis. This transformation of real-time resources can save 1400 cores, save 400G of Redis memory, and save Kafka300 Partiton. The amount of data per day is reduced from 100 billion to 10 billion. Through the specific sharding method and push-down parameters, the use of partitions, primary keys, indexes and other means supports event analysis (the average query time takes 2.77s), event merging and deduplication analysis (average query Time-consuming 1.65s), single-user detailed search (average query time 16.2s), funnel analysis (average query time 0.58s), retention analysis and path analysis from minute-level query to response within 10s. The data structure is shown in the figure:

Has the following characteristics:

  • Full model aggregation: We have designed a general-purpose traffic aggregation model since the middle of 21 years, which can be considered as a full-information hive traffic model structure. Except for the degradation of the time dimension, the rest of the information can basically be completely preserved. The original order of magnitude of 100 billion Can be compressed into tens of billions;
  • BulkLoad outbound: data is imported from HDFS to ClickHouse in batches, and hundreds of billions of data can be imported within one hour. The principle will be introduced later;
  • Dictionary service upgrade: We have greatly enhanced the dictionary service performance through the enhanced version of snowflake + redis + the company's self-developed rockdbKV storage, and the pressure test can support 400,000 QPS;
  • User attribute real-time calculation mode: instead of using the pre-computation mode, we use another set of CK-based labeling platforms to generate cross-cluster associations of designated user tag groups and calculate instantly, so that the user attributes to be analyzed can be flexibly specified.

By the middle of 22 years, with the rise of data lakes, we migrated the hive traffic aggregation model to Iceberg, and daily event queries can be completed within 10 seconds, which can be used as a backup link for CK data. This link not only reduces emergency operation and maintenance costs, improves data availability, but also supports users' daily traffic associated with other business customized query data. In addition to supporting traffic behavior logs, the general model structure can quickly access other server logs through mapping management to expand its usage scenarios. The following figure shows the usage of each functional module in the last week of December 22:

From the perspective of development history, user behavior data analysis has gone from a strong offline engine drive to a strong OLAP drive, which is inseparable from the continuous development and progress of big data technology in the industry. The underlying details of Polaris behavior data will also be switched to Hudi later, which can meet more real-time requirements. data consumption, let professional tools do professional things.

3. Event and retention analysis

Event analysis refers to the operation of relevant index statistics, attribute grouping, calculation, condition screening and other operations on specific behavioral events. In essence, it is to analyze the user triggering of buried events and the analysis and statistics of buried events. Retention analysis can customize the initial behavior and follow-up behavior for retention calculation according to different business scenarios and product stages, assist in analyzing the user's stickiness in using the product, adjust strategies in a targeted manner according to the retention analysis results, guide users to discover product value, and retain Retain users and realize real growth of users.

In the past, most of the analysis modules of the Polaris analysis platform were based on the 100 billion detailed behavior data of station B. Through the indicator functions of the ClickHouse query engine such as uniq(), it can support the analysis of single events, the comparative analysis of multiple events, and the compounding of multiple events Index calculation supports behavior retention analysis within a specified time period (the ratio of users who participate in subsequent behaviors to users who participate in initial behaviors), and meets diverse analysis needs through components such as screening and grouping. However, the analysis of Polaris events in the past was based on detailed data. The behavioral data of Station B increased by hundreds of billions per day, and the daily storage increased by more than 10T. The resource consumption was huge, and the analysis and query of detailed data was relatively slow. The average user experience of slow query for 30s to 50s per day was poor. Moreover, its functions are relatively thin, and it can only support a query window of 30 days. Complex analysis modules such as user retention and user grouping are difficult to implement. Moreover, the analysis of massive behavioral data also faces many challenges. Hundreds of billions of behavioral data are written every day, and more than one million QPS are written during peak periods. How to implement a calculation method that meets both timeliness and massive data pressure? How can compressed storage improve query efficiency while satisfying complex analysis scenarios? How to simplify the data link, reduce access costs and improve scalability through modularization and plug-in? How to standardize Polaris' behavior analysis capabilities through tags, ABTest and other business systems?

Polaris event analysis:

In order to solve the above pain points and the challenges of massive data analysis, the new event and retention analysis is modeled and layered in a quasi-real-time manner, and the granularity of users, events, and time is pre-aggregated and compressed, which not only unifies the offline caliber, but also expands the convergence The spark script can bear the pressure of hundreds of billions of data, and it can realize rich analysis modules with a variety of aggregation models. At the same time, real-time resources are released for offline hourly tasks to ensure timeliness. Dimension table pressure is solved by joining offline dimension table + attribute dictionary dimension service, and it is earlier than the platform's self-developed BulkLoad warehouse tool that can specify shards, and push down parameters to speed up queries. , the data link is scalable and easy to operate and maintain. Compared with the previous processing of hundreds of billions of detailed data, the data compression is realized at the DWB layer in quasi-real time, and the daily data of hundreds of billions is compressed to the level of tens of billions per day. The OLAP layer also replaces the original detailed data with the summarized data, greatly reducing the storage and improving the query performance. The daily slow query of users can be reduced to less than 10s, and the time window can be expanded to 45 days or even longer. And it can better support highly complex queries such as user retention, user grouping and other analysis scenarios.

Event analysis data development process:

The specific implementation includes the following core parts:

1. Create traffic aggregation model.

Firstly, clean the 100 billion detailed behavior data of station B on the DWD layer in quasi-real time. The traffic data is divided into private parameters and public parameters. Among them, the public parameters will not change frequently at the user granularity. We will use general aggregation functions to obtain data within a certain period of time. Specify the latest unchanged public parameters under the specified device and behavior events, and write the dimension names of private parameters that change frequently under the same granularity into the Array structure. Using the principle of map index, combine the private parameter dimension values ​​through spark custom logic counting and into the key of the map, and the value of the map is used to write the aggregation results of various public indicators. The whole process is implemented through the spark script, and finally written into the Iceberg engine. Because Iceberg can be associated with any other existing hive tables, it can also support multiple other business applications through fast business table association, and can also be used as a Polaris downgrade backup solution that does not leave the warehouse to support most query and analysis functions.

Traffic aggregation model data scheme:

2. The traffic aggregation model is queried under iceberg.

As shown in the figure below, the aggregated data forms a DWB layer and lands in the iceberg table (that is, iceberg_bdp.dwb_flow_ubt_app_group_buvid_eventid_v1_l_hr in the figure), and can calculate indicators under most query dimensions on hive and spark. Using Trino to realize the separation of storage and computing based on the connector, a series of complex event analysis can be realized through trino condition functions such as map_filter and array_position and trino index functions such as map_values ​​and reduce. Of course, we have also developed some simple and easy-to-use UDFs that can be used around A more complex trino function combination is developed for user query, and the performance is not much different.

3. Create public and private parameter filters.

Next, we use the BulkLoad outbound script to import the iceberg data into the ClickHouse table (that is, polagrou.polaris_dwb_flow_ubt_group_buvid_eventid_pro_i_d_v1 in the figure), which ensures timeliness and is compatible with the special data structure. The ClickHouse table structure design supports the sampling function of SAMPLE BY murmurHash3_64 (buvid). Since the buvid (device id) shard writing can ensure the random distribution of data on a single node, it can be realized as long as sampling is performed on a single node and the ReplicatedReplacingMergeTree engine is used. The ck to ck materialized filter is added to directly provide the Polaris analysis platform with the dimension filtering function of public parameter dimension aggregation and private parameter enumeration sorting. The whole process is directly implemented on the python script that supports scheduling, and can support updates in the past hour.

4. The traffic aggregation model is queried under ClickHouse.

Design a specific CK-UDF on the ClickHouse query to parse the nested map structure, ensuring complex analysis scenarios and accelerating the query, which is about 30% faster than the original multiple function combination parsing of ClickHouse, which is faster than the original detailed model. Queries are much faster. Moreover, the multi-dimensional ClickHouse hour-level robot monitoring alarm is realized through the script, which is earlier than the platform's support for this customized monitoring alarm.

At present, the average query time of the Polaris analysis platform is 3.4s. Through the general aggregation model, the downstream can perform cross-merging calculations on behavioral groups to realize conversion analysis functions such as label portraits and group selection, and can also use the Retention function to realize N-day event retention analysis. . In the end, compared with the previous generation solution, it saves computing resources by 1400C, saves storage resources by 40%, improves query efficiency by more than 60%, and uses RBM to realize multi-service access such as Polaris, Tags, and ABTest.

4. Funnel and path analysis

In the traffic business analysis scenario, the path flow information of a group of users on the client terminal or web page will be checked. Path analysis presents the user's usage path in the product with a Sanji diagram, showing the user's traffic direction between pages and page flows. Path analysis can help verify product operation strategies and optimize product design ideas. A funnel is a series of behavioral transformations that users complete while using a product. Funnel analysis can help understand the conversion or loss of users in the behavior steps, and then improve the conversion rate by optimizing products or carrying out operational activities to achieve business goals.

In the case of growing business, the demand for detailed analysis of user funnels and paths is gradually increasing. For this reason, the Polaris analysis platform adds this type of support to analyze the traffic flow changes of a group of users before and after a certain page or module. Funnel analysis is common in the industry to solve such scenarios. ClickHouse provides a function called windowFunnel to implement funnel analysis on detailed data. The path analysis technology is generally divided into two types, one is simple path analysis for detailed data combined with sequenceCount(pattern)(timestamp, cond1, cond2, ...), and complex path analysis is also called intelligent path analysis. ClickHouse Provides high-order array functions for curve saving.

Path Analysis Background Challenges:

However, the traffic funnel and path analysis in the past were all based on detailed data. Large consumption of storage resources, slow analysis and query, relatively weak functions, etc. In order to solve the above pain points, the new funnel and path analysis compress hundreds of billions of data per day to billions per day through technologies such as offline modeling and layering, pre-aggregation of user path granularity, and RBM materialized view of the storage engine ClickHouse. The query efficiency is also optimized from the minute level to the second level, and various conversion query analysis is supported through associated tags and crowds. While the storage is greatly reduced, the query performance is greatly improved, and functions such as associated tags and group selection are finally realized.

Path analysis function page:

The specific implementation includes the following core parts:

1. Path aggregation DWB model creation.

Firstly, the 100 billion detailed behavior data of station B is processed offline, the private parameters that change frequently are cut out by dimension, the public parameters under the user granularity are retained, and all events of the same buvid are aggregated according to the time The lines are aggregated in series into one field, and the aggregated data forms the DWB layer and lands in the hive table.

Path analysis data scheme:

2. Path aggregation DWS model creation.

On the basis of the previous step, the data of the DWB layer is summarized for the path, and the buvid (device id) of the same path is aggregated into the array structure. There are many interference events in this process. For example, some paths will appear frequently and will Out of sequence interferes with real user behavior, so we will use deduplication and other means to filter and splice interference events to form Sankey diagram nodes. Of course, we also introduce RBM data structure to store aggregated device codes, and finally fall into hive surface. The whole process is implemented through spark scripts using codes and algorithms.

Funnel analysis query scheme:

3. Path aggregation model Clickhouse table design.

Next, we use platform tools to export hive data to ClickHouse. In the design of the ClickHouse table structure, we adopt ClickHouse's materialized view technology and RBM data structure, further compress the buvid (device id) set into RBM code, and use the array to materialize the RBM The method greatly compresses the storage, and the path-related indicators can be calculated through Bitmap intersection, and hundreds of billions of data can be compressed to billions to achieve second-level query.

Path analysis data protocol:

The tree diagram formed by the data structure:

4. Path aggregation model funnel analysis query.

In terms of function, the funnel analysis is calculated by the windowFunnel function, and the behavior details of each user in the calculation period are aggregated into the corresponding event chain in chronological order, and then the event chain that meets the funnel condition in the sliding time window is searched, and the largest event that occurs from the chain is calculated. The number of events is the level, and finally the uv is calculated for each level to obtain the result.

The number on the right node represents the path uv from the central event e0 to itself:

The corresponding relationship in the tree diagram: it means that the total uv of the path e0->e4→e1→e3→e2 in the window period is 1. The same goes for the left side, in the opposite direction.

5. Path aggregation model path analysis query.

Similarly, path analysis uses data protocols and complex SQL to draw path tree diagrams on the basis of ClickHouse data, and then splices out Sankey diagrams, which can intuitively show the mainstream flow of users, help determine the key steps in the conversion funnel, and quickly find that they are ignored by users The value points of the product, correct the value point exposure method and find the user's loss point, and realize the conversion analysis functions such as label portrait and crowd circle selection through the intersection and calculation of Bitmap.

5. Tags and group selection

Bilibili’s Polaris behavior analysis platform, label portrait platform, and AB experimental crowd package are all implemented based on ClickHouse’s RBM (RoaringBitMap). In addition, RBM has many other applications, such as event analysis label crowd selection, pre-calculated path analysis, Create user groups for user behavior, etc. For details, please refer to the previous article [1].

The following figure is based on the underlying data of Polaris CK to generate a crowd package logic that meets the specified behavior results:

RBM is easy to use, but it only supports int or long types. What if the deduplication field is not int or long? How to achieve high availability and high concurrency in the dimension service of the massive data application layer? How to quickly restore the dependent link if there is a problem, and how to protect the data?

The attribute dictionary dimension service is a service system that can decode and encode multi-service attributes, output and manage multi-service dimensions, and has the characteristics of distribution, high availability, and high concurrency. Through the attribute dictionary dimension service, multi-dimensional management and multi-service connection can be realized, providing massive Data application layer customization provides technical support.

Attribute dictionary dimension service architecture design:

In terms of high availability, Grpc + LoadCache + Redis + the company's self-developed rockdbKV storage, multi-level cache distributed architecture supports smooth expansion and rolling release, and can achieve a daily cache hit rate of more than 70%. The underlying ID generation algorithm is based on Leaf-SnowFlake. The test can support more than 50w QPS high concurrency. All requests can be synchronized to hive for backup on an hourly basis through the company's log transmission channel. In case of an accident, with BulkLoad read-write separation, 2 billion+ attribute dictionaries can be restored within 40 minutes.

Finally, use the attribute dictionary to encode and decode business attributes such as buvid (device id), create user tags and AB groups, and realize the multi-service connection of Polaris analysis platform, user portrait platform, and AB experiment platform through RBM intersection and calculation.

Crowd circle selection sql example:

6. Evolution of ClickHouse data import scheme

As mentioned above, Polaris is a set of massive UBA technology solutions based on ClickHouse. The stability, read and write performance, and resource usage of the underlying ClickHouse cluster will all affect the experience of upper-level services. At the same time, how to import massive data into ClickHouse, as well as the stability, import efficiency, and resource consumption of the data import process largely determine the overall stability and efficiency of the ClickHouse cluster. Therefore, a stable and efficient data import solution is essential for a UBA solution.

At Station B, the data import solution for the UBA scenario has roughly undergone three stages of evolution:

1. JDBC writing scheme

Inside Station B, there are two sets of pipelines for writing data to each database/engine: one is an offline import link based on Spark, and most of the data comes from Hive; the other is a real-time import link based on FLink. Most of the data sources come from Kafka. These two sets of links both support clickhouse as a data sink. The UBA scenario was initially based on these two sets of links for data import, mainly using the real-time import link. In a few cases such as the initial import of historical data and fault compensation The offline import link is also used below.

As shown in the figure above, both offline and real-time imports use ClickHouse JDBC to send data to ClickHouse. This writing method is relatively simple to implement, and the open source ClickHouse JDBC Driver can use the standard JDBC interface to write data to ClickHouse. At the same time, the latency of data written by Flink in real time is relatively low, and the end-to-end latency can be controlled at the second level. But this solution has the following problems:

The resource consumption of the ClickHouse Server side is relatively large (because the data sorting, index generation, data compression and other steps are all completed on the server side), which will affect the query performance at peak times.

The writing frequency of real-time tasks is high, and the data will trigger a large number of merge operations after writing, resulting in "write amplification", consuming more disk IO and CPU resources, and may cause too many parts errors.

Real-time Flink tasks need to occupy a lot of resources for a long time, and in the case of failures, problems such as data accumulation, delay, and flow interruption are prone to occur, and the operation and maintenance costs are high.

The above problems will not affect business usage when resources are abundant, but when cluster resources are close to the bottleneck, query performance is affected by writes, and write performance and stability are affected by merge, which eventually leads to a decline in the overall stability of the cluster and affects business usage .

2. BulkLoad import scheme based on intermediate storage

Multiple analysis modules in the UBA scenario have different data latency requirements. Most data do not have high real-time requirements, and hour-level latency is acceptable for most modules. Therefore, in order to solve the above-mentioned problems of the JDBC writing scheme, we have built a BulkLoad import scheme based on intermediate storage for most of the data import requirements that do not require high timeliness:

First, transfer the generation process of the data part file in the clickhouse format to the Spark Application, so that the resources of the Yarn cluster can be used to complete steps such as data sorting, index generation, and data compression.

The generation of the data part file is realized by using the clickhouse-local tool, calling clickhouse-local in the Spark Executor to write the data to the local disk, and generating the clickhouse data part file.

Then, upload the data part file generated by Spark Executor to a specific directory of the HDFS file system.

Next, send the "ALTER TABLE ... FETCH PART/PARTITION" SQL statement from the Spark Executor to the clickhouse server for execution.

Finally, ClickHouse Server executes "ALTER TABLE ... FETCH PART/PARTITION", pulls the data part file from HDFS and completes the attach operation. Among them, we have made some changes to the ClickHouse code, so that the FETCH statement supports pulling files from HDFS.

Since the process of writing data into the data part file by Bulkload import is moved to the Spark side, the resource consumption of ClickHouse Server data writing is greatly reduced. At the same time, since the repartition and batch collection have been completed before the data is written in batches on the Spark side, the number of data parts reaching the ClickHouse Server is much smaller than that written by JDBC, so the merge pressure of the clickhouse is also greatly reduced. After the solution is launched, the impact of data writing on clickhouse queries is basically eliminated, and the stability of the cluster is greatly improved.

But there are still some problems with this solution:

Using HDFS as the intermediate storage for file transfer increases the time-consuming and network overhead of data transfer, and also occupies HDFS storage resources.

The load of HDFS may affect the performance and stability of ClickHouse Bulkload data import.

3. BulkLoad import scheme directly to ClickHouse

In order to further optimize the performance and stability of data import, we developed the DataReceive service of ClickHouse with reference to the DataExchange service of data synchronization between ClickHouse replicas to support Spark Executor to directly transfer data part files to ClickHouse Server, bypassing HDFS intermediate storage.

The DataReceive service allows the HTTP client to directly send data files to ClickHouse, and the ClickHouse side will perform operations such as authentication, data verification, flow control, concurrency control, and disk load balancing. Compared with the Bulkload solution based on HDFS intermediate storage, this solution roughly doubles the performance improvement.

7. ClickHouse data rebalancing

Bilibili’s daily user behavior data amounts to hundreds of billions of lines. The UBA scenario needs to analyze the historical data of more than half a year, so the underlying ClickHouse needs to store PB-level compressed data. At the same time, with the increasing number of active users of Station B, the amount of data that needs to be stored is also increasing, so the need for cluster expansion is essential.

However, due to the limitation of the storage-computing integrated architecture design, the ClickHouse cluster is currently unable to achieve elastic expansion, and the data needs to be redistributed in the new cluster. Therefore, how to efficiently and stably complete data rebalancing (Data Rebalance) by ClickHouse is a problem that ClickHouse cluster managers must face and solve.

During the preparation and implementation of cluster expansion in UBA scenarios, we experienced the evolution from manual, semi-automated, and service-oriented. During this period, we transformed the problems and solutions encountered in the practice of mass data rebalancing into a set of automated tool services. Next, let's introduce the functions and implementation principles of this set of tool services.

1. Balance

The size of the tables in the cluster varies greatly. Some reach hundreds of TB, and some only have a few GB. How to measure the balance of data and filter out the tables that need to be balanced? We introduce some mathematical formulas to solve this problem.

Coefficient of variation: When it is necessary to compare the degree of dispersion of two sets of data, if the measurement scales of the two sets of data differ too much, or the dimensions of the data are different, it is not appropriate to directly use the standard deviation for comparison, and the measurement scale should be eliminated at this time And the impact of dimension, and the coefficient of variation can do this, it is the ratio of the standard deviation of the original data to the mean of the original data, the value range is 0~1, the smaller the value, the smaller the degree of dispersion.

Table balance = coefficient of variation (value range 0~1, the larger the value, the more unbalanced the table)

Example: Balance of Table A

There are 4 nodes in the cluster, and the sizes of table A on different nodes are 4GB, 10GB, 5GB, and 3GB respectively

Average: (4 + 10 + 5 + 3) / 4 = 5.5

Variance: (x - mean) ^ 2 / 4 = 7.25

Standard deviation: root(variance) = 2.69

Coefficient of Variation: Standard Deviation / Mean = 0.49

Table A's balance = 0.49

2. Balance algorithm

For the table to be balanced, some businesses expect maximum balance, improve parallelism, and maximize the computing power of the cluster, while some tables have too large capacity, and the business expects to quickly balance data with the minimum migration cost to achieve a relatively optimal balance .

For different business requirements, two balancing algorithms, bin packing algorithm and greedy algorithm are provided.

It is expected to achieve the ultimate balance, and when the amount of data is small, it is recommended to use the binning algorithm. It is expected to achieve a better balance with the minimum migration cost, and it is recommended to use the greedy algorithm.

(1) Bin packing algorithm

The algorithm as a whole adopts the design of Best Fit (optimal packing algorithm) + AVL tree. Each ClickHouse node is a Node, and each Node has an initial threshold capacity, which represents the capacity of the ClickHouse node. Sort the parts to be balanced in order of size, and fill them into Nodes according to the Best Fit algorithm. Nodes form an AVL tree according to the remaining_capacity (remaining capacity), so as to improve query efficiency and quickly complete the balance.

The design is shown in the figure below.

The details of the box packing algorithm will not be repeated here, and interested readers can refer to [2].

(2) Greedy algorithm

The algorithm as a whole adopts the design of continuous polling + local optimum. Sort the ClickHouse nodes according to their size, and find out the largest and smallest nodes. If a part is relocated from the largest node to the smallest node, and the relocated node is still larger than the relocated node, relocate the part until the largest node cannot be relocated. out. By analogy, continue to sort ClickHouse nodes according to size, find the largest and smallest nodes each time, and balance the part to a local optimum until the polling of ClickHouse nodes ends.

The design is shown in the figure below:

3. Balance plan

According to the balance algorithm, the move-in and move-out of the node plan in the cluster can be obtained. The balance unit is the table level, and the migration granularity is part, which can be understood as the internal part balance of the table.

As shown in the figure below, you can see the balance degree before and after the table balance, as well as the move-in and move-out of the node 1 plan. After the balance plan is generated, you can choose to execute a specific balance plan as required.

4. Rebalance execution process

How to accurately and efficiently move parts in and out during the execution of the balance plan? How to ensure atomicity and avoid data loss or duplication? How to limit the current to avoid taking up too many resources due to balance and affecting the stability of the cluster?

After continuous testing and adjustment, a relatively robust balance plan was finally formulated. The overall process is: pre-judgment (whether to merge) + fetch (move-in node) + detach (move-in node) + attach (move-in node) + detached (move out node) + drop detached (move out node).

For exceptions at different stages during the balancing period, corresponding retry and rollback mechanisms are added to cover problems such as network jitter and zookeeper reconnection, thereby ensuring the atomicity of balance and data consistency.

During the balance period, the current limiting configuration (max_replicated_fetches_network_bandwidth) is used to control the balance speed, which ensures the stability of the cluster and avoids affecting the normal queries of other businesses.

The overall design is shown in the figure below.

Eight, ClickHouse application optimization practice

In the process of supporting various functional modules of the UBA scenario, we have done a lot of application optimization work on ClickHouse's query and storage. A few optimization points are selected below for a brief introduction.

1. Query push down

The query for the distributed table in ClickHouse will be rewritten as a query for the local table and sent to each shard of the cluster for execution, and then the intermediate calculation results of each shard will be collected to the query node for merging. When the intermediate calculation results are large, such as countDistinct, windowFunnel function, etc., the data collection and data merging of query nodes may become the performance bottleneck of the entire query.

The idea of ​​query pushdown is to try to push down the calculation to each shard for execution, and the query node only collects and merges a small amount of final calculation results. However, not all queries are suitable for pushdown optimization. Queries that meet the following two conditions can be considered for pushdown optimization:

The data has been sharded according to the calculation requirements: for example, the data of the UBA scene has been sharded according to the user ID, so for the user's funnel analysis, calculations such as UV can be pushed down to each shard for execution. Otherwise, the calculation result after pushdown is inaccurate.

The intermediate results of the calculation are large: calculations such as sum and count do not need to be pushed down, because the intermediate results are small, and the combined calculation is very simple, and the push-down does not bring about performance improvement.

Next, let's take the funnel analysis mentioned above as an example to explain how to do query pushdown.

The above figure is a SQL that uses the windowFunnel function to implement funnel analysis. As shown in the "Execution Steps" in the figure, this query needs to collect a large amount of data from each shard and complete the calculation at the query node, which will generate a large amount of data transmission and single-point calculation.

We first use the configuration distributed_group_by_no_merge to do a pushdown optimization:

Optimize SQL-V1 to push down the calculation of windowFunnel to each shard for execution, and only aggregate the final result of windowFunnel at the query node. In our scenario, the performance of this version has been improved by more than 5 times compared with the previous version.

In order to further push down the query, we use the function combination of cluster + view to further push down the aggregation query:

The performance of optimized SQL-V2 is further improved by 30+% compared with optimized SQL-V1.

2. Hop count index support for Array and Map

The event data in the UBA scenario has many public and private attributes. The public attributes are designed as fixed fields of the table, and the private attributes are stored in Array/Map because each event is different. The original design uses two arrays to store attribute names and attribute values ​​respectively. After ClickHouse supports the Map structure, Map is used in subsequent modules to meet similar requirements. Neither Array nor Map initially supports the creation of hop index, so when other index fields have limited filtering effects, operations on Array and Map may become the performance bottleneck of the query.

In response to this problem, we added Bloom filter and other hop index support to Array and Map, and only built indexes for Map keys. In some low-frequency private attribute filtering scenarios, the hop index of Array/Map can reap several times the performance improvement.

3. Compression algorithm optimization

There are three commonly used data compression methods in ClickHouse, namely LZ4, LZ4HC and ZSTD. For different data types, using a specific encoding method for data distribution can greatly improve the data compression rate to reduce storage costs.

For the UBA scenario, we tested the compression ratio, write performance, and query performance of different compression algorithms. Compared with the default LZ4, ZSTD(1) can generally save more than 30% of storage space in terms of compression rate, and there is no obvious difference in query performance, but the write performance drops by about 20% in some scenarios. Due to the high pressure of UBA scene data storage and the low requirement for data timeliness, we finally chose ZSTD(1) as the main compression method.

Nine, the next step

1. Multi-service general model support

The generalized form of the UBA scene is actually people + content + behavior. For example, users can generate barrage behaviors or like behaviors while watching the scene. This kind of data is different from traditional SDK log data, which has a common embedded format, but we can Behavior analysis of server-side logs is realized by abstractly mapping to a general behavior aggregation model. At present, we are generalizing support for community server logs and other non-standard business SDK logs, and reuse existing capabilities as much as possible to improve user query and analysis efficiency.

2. Clickhouse enhances multi-dimensional filtering scene support

In the UBA scenario, the same table may be used in multiple modules. For example, user behavior event data is used in analysis modules such as event analysis, and is also used in detailed query of single user behavior. The query of the table in these two usage scenarios is based on different filtering dimensions, but the current primary key index of clickhouse is difficult to have a good filtering effect on multiple dimensions at the same time, so it is difficult to meet the query performance in multiple scenarios at the same time Require. We have completed the development of the ZOrder index, and are currently developing the corresponding encoding type, so that the data in the UBA scenario can use the ZOrder index to support efficient queries in multiple dimensions at the same time.

Guess you like

Origin blog.csdn.net/qq_36130719/article/details/130944034