Daily increase of tens of billions of data, query results in seconds, Apache Doris commercialized unified OLAP application practice in 360

Guide: 360 commercialization In order to help the business team better promote commercial growth, the real-time data warehouse has experienced the evolution of three models, namely Storm + Druid + MySQL mode, Flink + Druid + TIDB mode and Flink + Doris mode , The successful implementation of the new-generation architecture based on Apache Doris enabled the 360 ​​commercialization team to complete the unification of the real-time data warehouse on the OLAP engine, and successfully achieved second-level query response in a wide range of real-time scenarios. This article will give you a detailed introduction to the evolution process and the specific implementation of the new generation of real-time data warehouses in advertising business scenarios.

Author|360 Commercialization Data Team Dou Heyu, Wang Xinxin

360 is committed to becoming a provider of Internet and security services, and is an advocate of free Internet security.

The commercialization of 360 relies on the huge user coverage and strong user stickiness of 360 products, through professional data processing and algorithms to achieve accurate advertising, helping hundreds of thousands of small and medium-sized enterprises and KA companies to achieve value growth. The 360 ​​commercialization data team mainly calculates and processes the data generated in the entire advertising delivery link, provides the product operation team with analysis data for strategy adjustment, provides the algorithm team with model training optimization data, and provides advertisers with advertising delivery information. performance data.

Business scene

Before officially introducing the commercial application of Apache Doris in 360, let's briefly introduce the typical usage scenarios in the advertising business:

  • Real-time market: The real-time market scene is the key carrier for us to present data to the outside world. It is necessary to monitor the indicators of the commercial market from multiple dimensions, including traffic indicators, consumption indicators, conversion indicators, and monetization indicators. Therefore, the accuracy of the data is very demanding. High (to ensure that data is not lost or duplicated), and at the same time, the timeliness and stability of data are also highly required, requiring second-level delay and minute-level data recovery.
  • Real-time consumption data scenario of advertising account: By monitoring the multi-dimensional indicator data at the granularity of the account, the consumption changes of the account can be discovered in a timely manner, so that the product team can promote the operation team to adjust the account budget according to the real-time consumption situation. In this scenario, once there is a problem with the data, it may lead to an incorrect adjustment of the account budget, thereby affecting the delivery of advertisements, which will cause immeasurable losses to the company and advertisers. Therefore, in this scenario, there is also a high impact on data accuracy. requirements. The current difficulty encountered in this scenario is how to achieve second-level query response under the premise of relatively large data volume and relatively fine granularity of query intersection.
  • AB experiment platform: In the advertising business, the algorithm and strategy students will conduct experiments for different scenarios. In this scenario, it has the characteristics of unfixed report dimensions, flexible combination of multiple dimensions, complex data analysis, and large data volume. This requires that the performance of data written into the storage engine can be guaranteed under the million-level QPS. Therefore, we need to carry out specific model design and processing optimization for business scenarios to improve the performance of real-time data processing and the efficiency of data query and analysis. Only Only in this way can we meet the query and analysis needs of the algorithm and strategy students for the experimental report.

Real-time data warehouse evolution

In order to improve the efficiency of data services in various scenarios and help relevant business teams to better promote commercial growth, the real-time data warehouse has experienced three evolutions so far, namely Storm + Druid + MySQL mode, Flink + Druid + TIDB mode and Flink + Doris mode, this article will give you a detailed introduction to the evolution process of real-time data warehouses and the specific implementation of the new generation of real-time data warehouses in advertising business scenarios.

first generation architecture

The real-time data warehouse at this stage is built based on Storm + Druid + MySQL. Storm is the real-time processing engine. After the data is processed by Storm, the data is written into Druid, and the written data is aggregated using Druid's pre-aggregation capability.

Architecture pain points:

Initially, we tried to rely on this architecture to solve all real-time business problems, and provided data query services through Druid. However, in the actual implementation process, we found that Druid cannot meet certain paging queries and Join scenarios. To solve this problem, We can only write data from Druid to MySQL regularly by using MySQL scheduled tasks (similar to using MySQL as a materialized view of Druid), and then provide external services through the Druid + MySQL mode. This method can temporarily meet the needs of some scenarios, but with the gradual expansion of the business scale, when faced with the query and analysis requirements under larger-scale data, the architecture has become unsustainable, and the defects of the architecture have become more and more obvious:

  • Faced with the continuous growth of data volume, the pressure on data warehouses has increased unprecedentedly, and it has been unable to meet the timeliness requirements of real-time data.
  • MySQL's sub-database and sub-table maintenance is difficult and costly, and the data consistency between MySQL tables cannot be guaranteed.

second generation architecture

Based on the problems of the first set of architecture, we carried out the first upgrade. The main change of this upgrade is to replace Storm with the new real-time data processing engine Flink. Compared with Storm, Flink not only expands many semantics and functions, The consistency of the data is also guaranteed, and these features greatly improve the timeliness of the report; secondly, we use TiDB to replace MySQL, and take advantage of the distributed characteristics of TIDB to solve the difficult maintenance problem of MySQL sub-database and sub-table to a certain extent ( To a certain extent, TiDB can carry a larger amount of data than MySQL, and can split fewer tables). After the upgrade is completed, according to the needs of different business scenarios, we write the data processed by Flink into Druid and TiDB respectively, and Druid and TiDB provide data query services to the outside world.

Architecture pain points:

Although the real-time data warehouse architecture at this stage has effectively improved the timeliness of data and reduced the difficulty of maintaining MySQL sub-databases and tables, new problems have been exposed after a period of use, which also forced us to conduct a second upgrade:

  • Flink + TIDB cannot achieve end-to-end consistency. The reason is that when facing large-scale data, opening transactions will have a great impact on TiDB’s write performance. In this scenario, TiDB’s transactions are useless. .
  • Druid does not support standard SQL, and there is a certain threshold for using it. It is very inconvenient for relevant teams to use data, which directly leads to a decline in work efficiency.
  • The maintenance cost is high, and two sets of engines and two sets of query logic need to be maintained, which greatly increases the investment in maintenance and development costs.

A new generation of real-time data warehouse architecture

For the second upgrade, we introduced Apache Doris and combined Flink to build a new generation of real-time data warehouse architecture. We used the hierarchical concept of offline data warehouses to construct real-time data warehouses hierarchically, and unified Apache Doris as the data warehouse OLAP engine, which was provided by Doris. Serve.

Our data mainly comes from dimension table material data and business management logs. Dimension table material data will be fully synchronized to Redis or Aerospike (similar to Redis's KV storage) at regular intervals, and incremental synchronization will be performed through Binlog changes. The business data is collected by various teams into Kafka, which is internally called ODS raw data (ODS raw data does not undergo any processing). We normalize the data at the ODS layer, including field naming, field type, etc., and some Invalid fields are deleted, and the DWD layer data is generated according to the business scenario. The DWD layer data is processed through business logic and associated with the dimension table data in Redis or multi-stream Join, and finally generates a large wide table for specific business (that is, the DWT layer data), we aggregate the DWT layer data and write it into Doris via Stream Load, and Doris provides data query services to the outside world. In the part of offline data warehouse, there are also some scenarios that need to write the processed DWS data into the Doris cluster via Broker Load every day, and use Doris to accelerate the query to improve the efficiency of our external services.

Reasons to choose Doris

Based on many features of Apache Doris, such as high performance, extremely easy to use, and real-time unification, 360 has successfully built a new generation of real-time data warehouse architecture for commercialization. This upgrade not only improves the reusability of real-time data, but also realizes the unification of OLAP engines. Moreover, it meets the stringent data query and analysis requirements of major business scenarios, making the overall real-time data process architecture simple and greatly reducing the cost of its maintenance and use. The important reasons why we choose Doris as the unified OLAP engine can be roughly attributed to the following points:

  • Materialized view: The materialized view of Doris fits very well with the characteristics of the advertising business scenario. For example, the query dimensions of most reports in the advertising business are relatively fixed. Using the characteristics of the materialized view can improve the efficiency of the query. At the same time, Doris can ensure that the materialized view and The consistency of the underlying data, this feature can help us reduce maintenance costs.
  • Data consistency: Doris provides the Stream Load Label mechanism. We can combine it with Flink's two-phase commit through transactions to ensure idempotent data writing. In addition, we have realized end-to-end data through self-developed Flink Sink Doris components The consistency of the terminal ensures the accuracy of the data.
  • Compatible with SQL protocol : Doris is compatible with MySQL protocol and supports standard SQL, which can realize cost-free connection for students who are developing, data analysis, and products. The company saved a lot of training and use costs, and also improved work efficiency.
  • Excellent query performance: Apache Doris has fully implemented a vectorized query engine, which makes Doris's OLAP performance more powerful. It has a very obvious performance improvement in various query scenarios, which can greatly optimize the query speed of reports. At the same time, relying on the implementation of columnar storage engine, modern MPP architecture, pre-aggregated views, and data indexing, it has achieved extremely fast performance in low-latency and high-throughput queries.
  • Low operation and maintenance difficulty: Doris has done a lot of automation work on cluster and data copy management. These investments make cluster operation and maintenance very simple, almost achieving zero-threshold operation and maintenance.

The specific landing on the AB experimental platform

Apache Doris is currently widely used in multiple business scenarios within the commercialization of 360. For example, in the real-time market scenario, we use the Aggregate model of Doris to join the fact tables of multiple real-time streams such as requests, exposures, clicks, and conversions; rely on Doris transaction features to ensure data consistency; The report dimension aggregates data and improves query speed. Since the consistent relationship between the materialized view and the Base table is maintained by Doris, this also greatly reduces the complexity of use. For example, in the real-time account consumption scenario, we mainly use Doris's excellent query optimizer to calculate the year-on-year ratio through Join...

Next, taking the typical business scenario of the AB experimental platform as an example, I will introduce the implementation of Doris in this scenario in detail, and the application in the above-mentioned scenarios will not be repeated.

AB experiment is widely used in advertising scenarios. It is an important tool to measure the improvement of product indicators by design, algorithm, model, and strategy. Analyze and verify data to optimize product solutions and improve advertising effectiveness.

There is also a brief introduction at the beginning of the article. The business carried by the AB experimental scenario is relatively complicated. Here is a detailed explanation:

  • The combination of various dimensions is very flexible. For example, it is necessary to analyze more than a dozen dimensions from DSP to traffic type to advertising position, and complete a complete traffic funnel from dozens of indicators such as requests, bids, exposures, clicks, and conversions. .
  • The amount of data is huge. The average daily traffic can reach tens of billions , and the peak value can reach one million OPS (Operations Per Second). A traffic may contain dozens of experimental tag IDs .

Based on the above characteristics, in the AB experiment scenario, we need to ensure fast data calculation, low data delay, and fast user query data, and on the other hand, ensure the accuracy of data and ensure that data is not lost or duplicated.

data landing

When faced with a situation where a traffic may contain dozens of experimental tag IDs, from an analysis point of view, only one experimental tag and a control experimental tag need to be selected for analysis; and if the method passed is to match among dozens of experimental liketags The selected experimental label, the implementation efficiency will be very low.

Initially, we expected to break up the experimental tags from the data entry point, split a traffic containing 20 experimental tags into 20 traffic containing only one experimental tag, and then import it into Doris' aggregation model for data analysis. In this process, we encountered an obvious problem. When the data is broken up, it will expand dozens of times, and tens of billions of data will expand to hundreds of billions of data. Even if the Doris aggregation model will compress the data again, the process It will put a lot of pressure on the cluster. Therefore, we gave up this implementation method and began to try to distribute part of the pressure to the computing engine. It should be noted here that if the data is directly scattered in Flink, when the Job global Hash window comes to Merge data, the data will be expanded dozens of times It will also bring dozens of times the network and CPU consumption.

Then we started the third attempt. In this attempt, we consider performing Local Merge immediately after splitting the data on the Flink side, opening a window in the memory of the same operator, first performing a layer of aggregation on the split data, and then The second layer of aggregation is performed through the global Hash window of the Job, because the two operators that are linked together are in the same thread, so the network consumption of data transmission between different operators after expansion can be greatly reduced. This method combines the aggregation of two layers of windows, combined with the Doris aggregation model, to effectively reduce the degree of data expansion. Secondly, we also simultaneously promote the real business side to regularly clean up offline experiments to reduce the waste of computing resources.

Considering the characteristics of the AB experiment analysis scene, we use the experiment ID as the first sorting field of Doris, and use the prefix index to quickly locate the target query data. In addition, materialized views are established based on commonly used dimension combinations to further reduce the amount of query data. Doris materialized views can basically cover 80% of query scenarios . We regularly analyze query SQL to adjust materialized views. Finally, through the design of the model, the application of the prefix index, combined with the materialized view capability, most of the experimental query results can be returned in seconds.

Data Consistency Guarantee

The accuracy of the data is the foundation of the AB experimental platform. When the algorithm team has painstakingly optimized the model to increase the advertising effect by a few percentage points, but the experimental effect cannot be seen due to data loss, such a result is indeed unacceptable. At the same time, this is why we Problems that are not allowed internally. So how do we avoid data loss and ensure data consistency?

Self-developed Flink Sink Doris component

We already have a set of Flink Stream API scaffolding internally. Therefore, with the help of Doris's idempotent write feature and Flink's two-phase commit feature, we have developed the Sink To Doris component to ensure the end-to-end consistency of the data. On this basis, we have added A data protection mechanism for exceptional situations.

In Doris version 0.14 (the version used at the beginning), we generally ensure data consistency through the mechanism of "the same Label ID will only be written once"; after Doris version 1.0, through "Doris transactions combined with Flink 2 "Phase commit" mechanism to ensure data consistency. Here is a detailed sharing of the principle and implementation of ensuring data consistency through the "Doris transaction combined with Flink two-phase commit" mechanism after using Doris 1.0 version.

There are two ways to achieve end-to-end data consistency in Flink, one is to combine idempotent writing at least once, and the other is to pass exactly one two-phase transaction.

As shown in the figure on the right, we first write the data to the local file in the data writing phase, pre-submit the data to Doris during the first phase, and save the transaction ID to the state, if Checkpoint fails, manually abandon the Doris transaction; if If Checkpoint is successful, the transaction is committed in the second phase. For the data that still fails after multiple retries of the two-phase commit, an option to save the data and transaction ID to HDFS will be provided, and manual recovery will be performed through Broker Load. In order to avoid the situation where the amount of data submitted in a single submission is too large, resulting in the Stream Load time exceeding the Flink Checkpoint time, we provide the option to split a single Checkpoint into multiple transactions. In the end, the guarantee of data consistency was successfully realized through the two-phase commit mechanism.

Application display

The figure below shows the specific application of Sink To Doris. The overall tool shields the API call and the assembly of topology streams. It only needs to complete the data writing from Stream Load to Doris through simple configuration.

cluster monitoring

At the level of cluster monitoring, we adopted the monitoring template provided by the community to build the Doris monitoring system from three aspects: cluster indicator monitoring, host indicator monitoring, and data processing monitoring. Among them, cluster indicator monitoring and host indicator monitoring are mainly monitored according to the community monitoring documentation, so that we can check the overall operation of the cluster. In addition to the templates provided by the community, we have also added monitoring indicators related to Stream Load, such as monitoring the current number of Stream Loads and the amount of written data, as shown in the following figure:

In addition, we also pay more attention to the time and speed of data writing to Doris. According to our own business needs, we monitor data processing related indicators such as task writing speed and data processing time to help us Find abnormalities in data writing and reading in a timely manner, and monitor and alarm with the help of the company's internal alarm platform. The alarm method supports phone calls, SMS, push, email, etc.

Summary and Planning

At present, Apache Doris is mainly used in advertising business scenarios. There are dozens of cluster machines, covering nearly 70% of real-time data analysis scenarios, realizing a full offline experimental platform and some offline DWS layer data query acceleration. The current average daily new data scale can reach tens of billions, and in most real-time scenarios, the query delay is within 1s . At the same time, the successful implementation of Apache Doris enabled us to unify the real-time data warehouse on the OLAP engine. Doris' excellent analysis performance and ease of use also make the data warehouse structure more concise.

In the future, we will expand the Doris cluster, isolate resources according to business priorities, improve the resource management mechanism, and plan to apply Doris to a wider range of business scenarios within 360 commercialization, and give full play to the advantages of Doris in OLAP scenarios. Finally, we will participate more deeply in the Doris community, actively give back to the community, walk side by side with Doris, and make progress together!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/8694738