[Meet Doris] Practice of Xiaomi Growth Analysis Platform Based on Apache Doris

The following article is from Xiaomi Technology, author Cai Conghui & Zhong Yun https://mp.weixin.qq.com/s/wo6gj1JvrFrm3t-KgGoIkg


1 Background

With the development of Xiaomi's Internet business, the need for each product line to use user behavior data to conduct business growth analysis is more and more urgent. Obviously, having each business product line build its own growth analysis system is not only costly but also inefficient. We hope to have a product that can help them shield the underlying complex technical details, so that relevant business personnel can focus on their own technical fields, thereby improving work efficiency. Through analysis and investigation, it is found that Xiaomi's existing statistical platform cannot support flexible dimension cross query, data query analysis efficiency is low, complex query needs to rely on R&D personnel, and there is a lack of efficient grouping tools according to user behavior, which is limited for users' operation strategies. Due to the weak and extensive facilities, the operation efficiency is low and the effect is not good. Based on the above needs and pain points, Xiaomi Big Data and the cloud platform jointly developed the Growth Analytics (GA) system, which aims to provide a flexible multi-dimensional real-time query and analysis platform, unify data access and query solutions, and help business Line for refined operations.

2 Introduction to Growth Analysis Scenarios

As shown in the figure above, analysis, decision-making, and execution are an iterative process. Therefore, the growth analysis query is very flexible. There are dozens or hundreds of dimensions involved in the analysis. We cannot define all the results to be calculated in advance, and the cost is too high. , so this also requires all data to be calculated and analyzed in real time. At the same time, decision-making is time-sensitive, so the delay from the time the data is ingested to when it can be queried cannot be too high. In addition, the business is developing rapidly, and new analysis dimensions need to be added, so we need to be able to support schema changes (mainly adding fields online). In our business, the three most commonly used functions of growth analysis are event analysis (the vast majority), retention analysis and funnel analysis; these three function businesses all require detailed data for real-time storage (only append), which can be Ad hoc selection of dimensions and conditions (usually adding a business profile table or a circled group of people), and then returning the results in seconds (industry-related products such as Sensors, GrowingIO, etc. can achieve this performance). Some pre-computing engines (such as Kylin) that only support early aggregation, although the query performance is excellent, it is difficult to support schema changes at any time, and many dimensions will also cause the cube storage to be out of control. Hive can meet the requirements in terms of functions, but its performance is relatively low. Difference. To sum up, we need to store and calculate detailed data, and we need a data analysis system solution that supports near real-time data ingestion, flexible schema modification and ad hoc query.

3 Technology Architecture Evolution

3.1 Initial Architecture

The GA project was established in mid-2018. At that time, based on the consideration of development time, cost, technology stack and other factors, we reused various existing big data basic components (HDFS, Kudu, SparkSQL, etc.), and built a set of Lamda-based architecture growth analysis query system. The architecture of the first-generation version of the GA system is shown in the following figure:

The GA system covers a complete set of processes such as data collection, data cleaning, data query and BI report display. First, we uniformly clean the data collected from the data source and write it into Talos (Note: Xiaomi's self-developed message queue) in a uniform json format. Then we use Spark Streaming to dump the data into Kudu. As an excellent OLAP storage engine, Kudu has the ability to support real-time data ingestion and fast query, so here Kudu is used as the storage of hot data, and HDFS is used as the storage of cold data. In order to prevent users from perceiving the actual existence of hot and cold data, we use the dynamic partition management service to manage the migration of table partition data, regularly convert expired hot data into cold data and store them on HDFS, and update Kudu tables and HDFS tables When the user uses the SparkSQL service to query the view, the computing engine will automatically route according to the query SQL to process the data in the Kudu table and the data in the HDFS table. Under the historical background at that time, the first-generation version of GA helped our users solve the pain points of relatively extensive operation strategy and low operation efficiency, but at the same time, it also exposed some problems. The first is the issue of operation and maintenance costs. The original design is that each component uses the resources of the public cluster, but in practice, it is found that in the process of executing query jobs, the query performance is easily affected by other jobs in the public cluster, and it is easy to jitter, especially when reading When fetching data from the HDFS public cluster, it is sometimes slow, so the components of the storage layer and the computing layer of the GA cluster are built separately. Another issue is performance. SparkSQL is a query engine designed based on a batch processing system. In the process of exchanging data between each stage, the shuffle still needs to be dropped to disk, and the delay in completing the SQL query is relatively high. In order to ensure that SQL queries are not affected by resources, we add machines to ensure query performance. However, in practice, it is found that the space for performance improvement is limited. This solution cannot make full use of machine resources to achieve the purpose of efficient query. A certain waste of resources. Therefore, we hope to have a new set of solutions that can improve query performance and reduce our operation and maintenance costs.

3.2 Re-selection

SQL query engines with MPP architecture, such as Impala, presto, etc., can efficiently support SQL queries, but still need to rely on components such as Kudu, HDFS, Hive Metastore, etc., and the operation and maintenance cost is still relatively high. At the same time, due to the separation of computing and storage, the query engine cannot be very If you can sense the data changes in the storage layer in a timely manner, you cannot do more detailed query optimization. If you want to cache in the SQL layer, you cannot guarantee that the query results are up-to-date. Therefore, our goal is to seek an MPP database that integrates computing and storage to replace our current components of the storage computing layer. We have the following requirements for this MPP database:

  1. Fast enough query performance.
  2. More comprehensive support for standard SQL, user-friendly.
  3. It does not depend on other external systems and is easy to operate and maintain.
  4. Community development is active, which facilitates our subsequent maintenance and upgrades. Doris is an MPP-based interactive SQL data warehouse open source from Baidu to the Apache community. It is mainly used for reporting and multi-dimensional analysis. It mainly integrates Google Mesa and Cloudera Impala technology to meet our above requirements. We conducted internal performance tests on Doris and communicated with the community to determine the solution for Doris to replace the original computing and storage components, so our new architecture is simplified as shown in the following figure:

3.3 Performance Test

Under the condition of configuring roughly the same computing resources, we selected a business with an average daily data volume of about 1 billion, and tested different scenarios (6 event analysis, 3 retention analysis, and 3 funnel analysis), different time ranges ( Query performance of SparkSQL and Doris from one week to one month).

As shown in the test results above, in the growth analysis scenario, the query performance of Doris is significantly improved compared to the SparkSQL+Kudu+HDFS solution. In the event analysis scenario, the query time is reduced by about 85% on average. In the retention and funnel scenarios On average, the query time is reduced by about 50%. For most of our business needs event analysis, this performance has improved a lot.

4 Doris practice and optimization

4.1 Usage of Doris in Growth Analysis Platform

With the increase of access services, our growth analysis cluster has expanded to nearly 100 units at the largest scale, and the stock data has reached the petabyte level. Among them, there are dozens of near real-time product line jobs, tens of billions of data are stored in the database every day, and the daily effective business query SQL reaches 1.2w+. The increase of business and the increase of cluster scale have brought us many problems and challenges. Below we will introduce some problems and countermeasures or improvements encountered in the operation and maintenance of Doris cluster.

4.2 Doris data import practice

The first challenge for Doris to access the business on a large scale is data import. Based on our current business needs, data must be imported in real time as much as possible. For the growth analysis cluster, there are currently dozens of business detail data tables that need to be imported in near real time, including several large businesses (the number of data entries per day in large businesses ranges from billions to tens of billions, the number of fields in 200~400). In order to ensure that data is not inserted repeatedly, Doris uses label to mark the import of each batch of data, and adopts two-phase commit to ensure the transactional nature of data import, either all succeed or all fail. In order to facilitate monitoring and management of data import jobs, we use Spark Streaming to encapsulate the stream load operation to import Talos data into Doris. Every few minutes, Spark Streaming reads a batch of data from Talos and generates a corresponding RDD. Each partition of the RDD corresponds to each partition of Talos, as shown in the following figure:

For Doris, a stream load job will generate a transaction, and the master node of the fe process of Doris will be responsible for the management of the entire transaction life cycle. If too many transactions are submitted in a short time, it will cause damage to the master node of the fe process. a lot of pressure. For each individual stream data import product line job, assuming that the message queue has m partitions in total, the data import of each partition in each batch may perform at most n stream load operations, so one batch of message queues The processing of the data may generate m*n transactions. In order to stabilize the data import of Doris, we adjust the time interval of each batch of data in Spark Streaming from 1min to 3min according to the size of the business data and real-time requirements, and try to increase the data sent by each stream load as much as possible. quantity.

In the initial stage of the cluster access service, this mechanism for importing streaming data into Doris can basically run smoothly. But as the scale of access services grows, problems also follow. First of all, we found that some large tables that have stored data for many days frequently failed to import data, which was manifested as an error when the data import timed out. After our investigation, we determined the cause of the data import timeout. When we used stream load to import data, we did not specify the write partition of the table (the online event tables are all partitioned by day), and some events The table has retained data for more than three months, and has more than 600 data shards per day. In addition, each table saves data in three copies by default. Therefore, about 180,000 writers need to be opened before writing data each time. The process of opening the writer has already timed out, but since the data is imported in real time, no data is written to the partitions of other days, so there is no need to open the writer. After locating the cause, we took corresponding measures. One is to specify the write partition when the data is imported according to the date of the data. Reduced from 600+ to 200+ (too many shards will affect the efficiency of data import and query). By specifying the partition for writing data and limiting the number of shards in the partition, large tables can also import data smoothly and stably without timeout.

Another problem that troubles us is that the increase in the business that needs to import data in real time has brought greater pressure to the master node of fe, and also affected the efficiency of data import. For each stream load operation, the coordinator be node needs to interact with the fe node multiple times, as shown in the following figure:

There was a period of time, we found that the number of threads on the master node soared occasionally, and then the cpu load increased, and finally the process hung up and restarted. Our query concurrency is not very high, so it is unlikely to be caused by the query. But at the same time, we have limited the data import on the fe side by setting the max_running_txn_num_per_db parameter, so why the number of threads of the master node of fe will soar makes us wonder. After viewing the logs, it is found that there are a large number of logs that fail to import the data into the execution plan on the be side. We do limit the maximum number of transactions that a single db can allow at the same time, but because fe needs to acquire the read lock of the db when calculating the execution plan, and the write lock of the db needs to be acquired to submit and complete the transaction, the emergence of some long-tail tasks leads to A lot of tasks for calculating the execution plan are blocked on acquiring the db lock. At this time, the be client finds that the rpc request has timed out, so it retries immediately. The thirft server on the fe side needs to start a new thread to process the new request, but the previous transaction The task was not cancelled, and the backlog of tasks continued to increase at this time, which eventually led to the avalanche effect. In response to this situation, we have mainly made the following modifications to Doris:

  1. When constructing the thread pool of fe's thrift server, the method of explicitly creating the thread pool is used instead of the native newCachedThreadPool method, and the number of threads is limited accordingly to avoid resource exhaustion due to the soaring number of threads. At the same time, the corresponding monitor.
  2. When the rpc request of be to fe times out, in most cases, it is caused by fe's inability to process the request within the specified time, so add buffer time before retrying to avoid further deterioration of the blockage of fe-side processing requests.
  3. The code of GlobalTransactionMgr has been refactored. On the basis of maintaining compatibility with the original interface, it supports db-level transaction isolation, minimizes the interaction between different transaction requests, and optimizes some transaction processing logic to speed up transaction processing efficiency.
  4. A timeout mechanism is added for acquiring the db lock. If the db lock cannot be acquired within the specified time, the task will be canceled, because the rpc request on the be side has also timed out at this time, and it is meaningless to continue executing the canceled task.
  5. Add a metric record to the time-consuming operation of each step of the coordinator be, such as the time-consuming to request to start a transaction, and the time-consuming to obtain the execution plan, etc., and return it in the final execution result, so that we can timely understand the time-consuming distribution of each stream load operation. . After the above transformation, the stability of our data import has been relatively improved. So far, the problem of the master node hanging up due to the excessive pressure of fe processing data import transactions has never occurred. However, there are still some problems in data import that need to be improved:
  6. The be side uses libevent to process http requests, and libevent using Reactor mode is generally the first choice for writing high-performance network servers, but this is not applicable to our scenario. Doris calls the business containing blocking logic many times in the callback function Code, such as rpc request, waiting for data distribution to complete, etc., because multiple requests share the same thread, the callback operations of some requests cannot be processed in time. At present, we do not have a good solution for this. The only countermeasure is to increase the number of concurrent threads of libevent to reduce the interaction between different requests. A complete solution still needs further discussion in the community.
  7. The fe side adopts db-level isolation when updating the partition version of the table. The granularity of this lock is too large, which leads to the competition for db locks when importing data from the same db and different tables, which greatly reduces the efficiency of fe in processing transactions.
  8. The operation of publishing transactions is still prone to the problem of publish timeout (which means that most transaction-related be nodes cannot get the response of publishing transaction operations within the specified time), which is a big obstacle to the efficiency improvement of data import. .

4.3 Doris Online Query Practice

In the growth analysis business scenario, the event table is our core table, and detailed logs need to be imported in real time. These event tables do not have aggregation and deduplication requirements, and the business requirement is to be able to query detailed information, so the redundant model (DUPLICATE KEY) is selected. The event table is partitioned according to the day level, and the bucketing field uses the log id field (actually a randomly generated md5). Its hash value can ensure that the data between the buckets is evenly distributed and avoid writing and querying problems caused by data skew. The following figure shows the query performance statistics of our largest online cluster in the last 30 days (the statistics of query information are from Doris' query audit log). The number of successful SQL queries per day in the last week was between 1.2w~2w.

As can be seen from the figure, after using Doris, the average query time is kept at about 10 seconds, and the maximum is not more than 15 seconds; the query time P95 can generally be guaranteed within 30 seconds. Compared with the original SparkSQL, the query experience is significantly improved. Doris provides the query concurrency parameter parallel_fragment_exec_instance_num. The query server dynamically adjusts it according to the number of running tasks to optimize the query. Increase the concurrency under low load to improve query performance, and reduce the concurrency under high load to ensure cluster stability. When analyzing the business query profile, we found that the concurrency before and after the exchange is the same during the default execution of Doris. In fact, for aggregated queries, the amount of data after the exchange is greatly reduced. If you continue to use the same concurrency, not only It wastes resources, and after the exchange, a smaller amount of data is executed with a larger concurrent execution, which theoretically reduces the query performance. Therefore, we added the parameter doris_exchange_instances to control the task concurrency after the exchange (as shown in the figure below), and achieved good results in actual business tests.

This is not obvious for a business with a huge amount of data or a query that cannot significantly reduce the amount of data after an exchange, but it is more obvious for aggregation or join queries for small and medium businesses (especially those that use more buckets). Our tests on businesses of different orders of magnitude also validate our inferences. We selected a small business with a data volume of 400 million per day, and tested the query performance in different scenarios:

As can be seen from the results in the above figure, doris_exchange_instances is significantly improved for small queries of aggregation and join types. Of course, this test is the optimal doris_exchange_instances value found after many tests. In actual business, it is less feasible to find the optimal value every time. Generally, for small and medium-sized businesses, the number of buckets that need to be scanned in the query plan is combined with the size of the cluster. Appropriate reduction, with a small cost to obtain a certain performance improvement can be. Later we contributed this improvement to the community, and the parameter name was changed to parallel_exchange_instance_num. In order to expand the query capabilities of SQL, Doris also provides UDF (User-Defined Functions) framework support similar to SparkSQL and Hive. When the built-in functions of Doris cannot meet the needs of users, users can implement custom functions to query according to the UDF framework of Doris. The UDF supported by Doris is divided into two categories (currently does not support UDTF, User-Defined Table-Generating Functions, one line of data input corresponds to multiple lines of data output), one is ordinary UDF, which generates the output of one data line according to the input of a single data line . The other type is UDAF (User-Defined Aggregate Functions), which is an aggregate function that receives the input of multiple data rows and produces the output of one data row. The execution flow of UDAF is shown in the following figure:

UDAF generally needs to define four functions, namely Init, Update, Merge, and Finalize functions. If the data type of the intermediate output is a complex data type, the Serialize function needs to be implemented to serialize the intermediate type during the Shuffle process. and deserialize the type in the Merge function. In growth analysis scenarios, retention analysis, funnel analysis, etc. use UDAF. Taking retention analysis as an example, it is an analysis model used to analyze user participation/activity, and examines how many users will follow up after the initial behavior. In response to the above requirements, we first define the function retention_info, the input is the behavior information of each user, and then use the id of each user as the key to group each user to generate each time unit (such as days, weeks, etc.) of each user within the specified time. , month, etc.) retention information, and then define the function retention_count, the input is the retention information of each user generated by the retention_info function, and then we use the retention time unit (usually days here) as the key to group, you can calculate each unit The number of users retained over time. In this way, with the help of UDAF, we can successfully complete the calculation of retention analysis.

4.4 Management of Doris tables

In our growth analysis scenario, from the perspective of whether to partition, Doris' olap table is mainly divided into two types, one is non-partitioned table, such as crowd package and business portrait table, crowd package table is characterized by a small amount of data , but the number of tables is large; the data volume of the business portrait table is small, and the data volume is medium, but there is an update demand. The other is the partition table, such as the event table, which generally has a large data scale in a single table. In the design, we use the time field as the partition key, and we need to add new partitions to the table every day, so that real-time data can be The partition of the current day is successfully imported, and the expired partition needs to be deleted in time. Obviously, it is not only cumbersome to let each business manage the partition of the table by itself, but also may make mistakes. In our original GA architecture, there was a dynamic partition management service. After using the Doris system, we integrated the dynamic partition management service into the Doris system, allowing users to set the number of partitions that need to be reserved by day, week, and month. The number of partitions created in advance. Another typical scenario of table management is to modify the schema of the table, and the main operation is to add fields of the table. At this stage, Doris only supports some basic data types. In the big data scenario, the data types of logs reported by business management are mostly nested types (list, map). Therefore, when accessing Doris, it needs to be expanded or converted, resulting in a relatively large number of fields in the Doris table. It is huge, and it is difficult to expand some types of fields. It has to be stored in varchar, which is very inconvenient to use, and the query performance is relatively low. Since Doris does not support nested data types, when an element is added to the nested type, the Doris table needs to add fields, and it takes a long time from submitting a request to add a field to successfully adding a field. When the amount of data and the number of tablets are relatively large, the problem of adding columns may fail. In response to the above problems, we have mainly made the following two improvements:

  1. Shorten the waiting time for the user to submit the modification schema request to the actual execution. When the system creates a transaction that modifies the schema of the table, the original design is to wait for all transactions of the same db greater than the transaction id number to be completed before the modification can be started. The schema of the table, we modify it to wait for the completion of all transactions related to the table and before the transaction id number to modify the schema of the table. When there are many data import jobs for the same db, this modification can greatly shorten the waiting time for modifying the schema, and also avoid some data import failures in other tables that may cause the operation of modifying the table schema to be delayed.
  2. Speed ​​up the creation of tablets whose tables contain the new schema. The principle of Doris modifying the schema is to complete the modification of the schema by creating a tablet containing the new schema, and then migrating the data of the old tablet to the new tablet. The be node manages all the tablets on the node through a map data structure. Since there is only one global lock, when the number of tablets is very large, some operations that manage the tablet must acquire the global lock to operate on the tablet, which will cause the creation of a new tablet to time out and the operation of modifying the schema to fail. In response to this situation, we performed shard operations on map and global locks to avoid the occurrence of timeouts when creating tablets.

5 Summary and Outlook

Since Doris launched the first business in Xiaomi in September 2019, it has deployed nearly ten clusters at home and abroad (the total scale of hundreds of BEs), and completes tens of thousands of online analysis queries every day. Most online analysis needs including growth analysis and report queries. From the results, replacing SparkSQL with Doris as the main OLAP engine not only greatly improves query performance, but also simplifies the current data analysis architecture. It is a relatively successful practice of Doris' large-scale service based on detailed data query. In the next period of time, we will continue to invest in improving the efficiency of real-time data import and optimizing the overall query performance. Since there are many businesses in the company that need to use the UNIQUE KEY model, the scan performance of this model and the DUPLICATE KEY model is currently There is still a relatively obvious gap compared with this, and this is also a performance problem that we need to focus on in the future.

6 About the Author

Cai Conghui, Xiaomi OLAP Engineer, Apache Doris Committer

Zhong Yun, Xiaomi Big Data Engineer


About Apache Doris (Incubating)

Apache Doris (Incubating) is an interactive SQL analysis database based on large-scale parallel processing technology. It was contributed by Baidu to the Apache Foundation in 2018 and is currently in the Apache Foundation incubator.

Doris official website: http://doris.incubator.apache.org/master/zh-CN/
Doris Github: https://github.com/apache/incubator-doris
Doris Gitee mirror: https://gitee.com/ baidu/apache-doris
Doris developer mail group: [How to subscribe]
Doris WeChat public account:

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324090907&siteId=291194637