Wonderful SQL|New ideas for optimization of deduplicated cubic calculations

introduction

SQL is currently the most common database query language. Its functions and features are far more complex than the commonly used "SELECT * FROM tbl". The performance of a good SQL and a poor SQL may be dozens or thousands of times. To write a SQL that can take into account both performance and ease of use, it goes beyond just understanding how many new features and new ways of writing, but also requires an in-depth understanding of the data processing process, and then designing a good data processing process.

Therefore, I would like to launch this series of articles and name it "Fantastic SQL". I hope to start from actual cases and share with you some new solutions and ideas for SQL data processing, and simulate the understanding of the essence of the problem in the process. , I hope you all like it~.

This article is the first in the series, sharing the optimization practices for heavy cubes in the data transfer transformation and upgrade process of Ant Group.

1. Scene description

When doing data summary calculations and statistical analysis, the most troublesome thing is the calculation of duplicate indicators (such as the number of users, the number of merchants, etc.), especially the drill-down analysis with multiple dimensions. Due to its non-additive nature, it is almost Each item must be recalculated if a statistical dimension combination is changed. When the amount of data is small, it can be considered that detailed data can be used automatically and directly for immediate statistics, but when the amount of data is large, calculations have to be made in advance.

Typical scenarios are as follows: the number of daily payment users of the Alipay client under the dimensions of province, city, and district (where province, city, and district are the locations where users make payments, corresponding to the data indicators in the table).

There is a situation where a user used Alipay to pay once in Hangzhou City in the morning, and then used Alipay to pay offline again when he went to Shaoxing City in the afternoon. Then when counting the number of daily paying users in the province + city dimension, it needs to be in the dimensions of Hangzhou City and Shaoxing City. It needs to be deduplicated by user, and it can only be Zhejiang Province dimension 1. In this case, it is usually necessary to complete data pre-calculation in the form of Cube. At the same time, each dimension combination needs to be deduplicated because it cannot be accumulated. The scenario in this article is roughly deduplication cube.

2. Common implementation methods

Calculated directly, each dimension combination is calculated separately. For example, multiple tables with dimension combinations such as province, province+city, province+city+district, etc. can be generated separately. Only fixed dimensions are calculated for each table. Then there is data expansion and recalculation, such as Union All or Lateral View Explode or MaxCompute's Cube calculation function. Through data expansion, a data calculation method that satisfies multiple dimension combinations is implemented, as shown in the figure below.

This third way of writing is actually similar. The focus is to expand the data as shown in the figure and then perform re-statistics. The execution process is as shown below. The core idea is to first "expand" the data into multiple rows, and then deduplicate statistics according to the "ordinary" Distinct. Therefore, there is no serious difference in performance, mainly in code maintainability.

3. Performance analysis

The core of the next method is to first "inflate" the data and split it into multiple rows, and then deduplicate statistics according to the "normal" ones. There is no difference in performance, mainly in terms of code maintainability. The calculation time of these several solutions will increase linearly with the combination of demand dimensions, and at the same time, their unique impact on poor computing performance will also be added.

In actual experiments, we found that during the calculation process of deduplication cubes, 80%+ of the computing cost is consumed in data expansion and data transmission. For example, to extract core indicator scenarios, it is necessary to calculate the number of paying users in various combination dimensions. In the actual experiment, a large amount of 10 billion data , the data has expanded nearly 10 times, the intermediate result data size has expanded from 100GB to 1TB, and the data volume has expanded from 10 billion to nearly 130 billion. Most of the computing resources and computing running time are spent on data expansion and transmission. If the actual combination dimension is further increased, the data expansion size will also be further increased.

4. A new idea

First, let’s break down the problem. The core of the calculation process of the deduplication cube is divided into two parts, data expansion + data deduplication. Data expansion solves the problem of a row of data satisfying the calculation of multiple dimensional combinations at the same time. The data deduplication block completes the final deduplication statistics. The core solution still stems from the need for the original data to match the result data. The amount of calculation for data deduplication itself will increase, and data expansion will aggravate this situation because a large amount of data needs to be disassembled and transmitted during the shuffle process. During the data calculation process, the data is first expanded and then aggregated. In addition, the Chinese and English string content of the data itself increases, which results in a large amount of data calculation and transmission costs.

Our core idea is to avoid data expansion while further reducing the data transfer size. Therefore, we think about whether we can adopt a data labeling scheme similar to user labeling, first perform data deduplication to generate UID-granularity intermediate data, and at the same time allow the required result dimension combination to be reversely attached to the UID-granularity data. In this process Record the number of result dimensions and use smaller data structures to store them to avoid large amounts of data transmission during the data calculation process. During the entire data calculation process, the amount of data theoretically converges gradually and will not increase due to the increase in statistical dimension combinations.

4.1.Core idea

The core calculation idea is as shown above. The ordinary method of data expansion calculation cube requires data expansion and re-aggregation. The number of combined dimensions required for result statistics is the multiple of data expansion, just like the "province, province + city" above. With a total of two dimensions combined, the data is expected to expand by 2 times.

The new data aggregation method uses a certain strategy to disassemble the dimension combination into small dimension tables and number them, and then aggregates the specific detailed data of which orders into user-granular intermediate process data, in which various combination dimensions are converted into digital labels. Recorded to user-dimensional data records, the amount of data in the entire calculation process shrinks and aggregates, and does not expand.

4.2. Logic implementation

  • Detailed data preparation: Taking user offline payment data as an example, detailed records include order number, user ID, payment date, province, final city, and payment amount. The indicator statistics requirement is to count a multi-dimensional cube containing the combined dimensions of province and city + the number of paying users.
order number User ID payment date Province City Payment amount
2023111101 U001 2023-11-11 Zhejiang Province Hangzhou City 1.11
2023111102 U001 2023-11-11 Zhejiang Province Shaoxing City 2.22
2023111103 U002 2023-11-11 Zhejiang Province Hangzhou City 3.33
2023111104 U003 2023-11-11 Jiangsu Province Nanjing City 4.44
2023111105 U003 2023-11-11 Zhejiang Province Wenzhou city 5.55
2023111106 U004 2023-11-11 Jiangsu Province Nanjing City 6.66

The overall program flow is as follows.

  • STEP1: Extract the required elements from the detailed data (i.e. Group By corresponding field) to obtain the dimension set.

  • STEP2: Generate Cube from the obtained dimension set, and encode the rows of the material Cube (assuming that two combined dimensions of province, province + city are required), which can be implemented using the Cube function of ODPS, and then sorted according to the generated Cube dimension combination Generate a unique code.
Original dimension: province Original dimension: province Cube dimension: province Cube dimension: city Cube row ID (can be generated by sorting)
Zhejiang Province Hangzhou City Zhejiang Province all 1
Zhejiang Province Hangzhou City Zhejiang Province Hangzhou City 2
Zhejiang Province Shaoxing City Zhejiang Province all 1
Zhejiang Province Shaoxing City Zhejiang Province Shaoxing City 3
Zhejiang Province Wenzhou city Zhejiang Province all 1
Zhejiang Province Wenzhou city Zhejiang Province Wenzhou city 4
Jiangsu Province Nanjing City Jiangsu Province all 5
Jiangsu Province Nanjing City Jiangsu Province Nanjing City 6
  • STEP3: Encode the Cube rows and write them back to the user details according to the mapping relationship. This can be implemented using Mapjoin.
order number User ID payment date Province City Aggregated Cube ID
2023111101 U001 2023-11-11 Zhejiang Province Hangzhou City [1,2]
2023111102 U001 2023-11-11 Zhejiang Province Shaoxing City [1,3]
2023111103 U002 2023-11-11 Zhejiang Province Hangzhou City [1,2]
2023111104 U003 2023-11-11 Jiangsu Province Nanjing City [5,6]
2023111105 U003 2023-11-11 Zhejiang Province Wenzhou city [1,4]
2023111106 U004 2023-11-11 Jiangsu Province Nanjing City [5,6]
  • STEP4: Summarize to the user dimension, visualize the Cube ID set field to remove duplicates (you can use ARRAY's DISTINCT)

  • STEP5: Calculate the count according to the Cube ID (since STEP4 has already removed the duplication, there is no need to remove the duplication here); then restore the dimensions according to the mapping relationship.
Cube ID Indicator of number of users placing orders Cube restore dimension: province Cube restore dimension: city
1 3 Zhejiang Province all
2 2 Zhejiang Province Hangzhou City
3 1 Zhejiang Province Shaoxing City
4 1 Zhejiang Province Wenzhou city
5 2 Jiangsu Province all
6 2 Jiangsu Province Jiangsu Province
  • End~

4.3. Code implementation

WITH 
-- 基本的明细数据表准备
base_dwd AS (
  SELECT   pay_no
          ,user_id
          ,gmt_pay
          ,pay_amt
          ,prov_name
          ,prov_code
          ,city_name
          ,city_code
  FROM tmp_user_pay_order_detail
)
-- 生成多维Cube,并进行编码
,dim_cube AS (
  -- Step02:CUbe生成
  SELECT *,DENSE_RANK() OVER(PARTITION BY 1 ORDER BY cube_prov_name,cube_city_name) AS cube_id 
  FROM (
    SELECT dim_key
          ,COALESCE(IF(GROUPING(prov_name) = 0,prov_name,'ALL'),'na') AS cube_prov_name             
          ,COALESCE(IF(GROUPING(city_name) = 0,city_name,'ALL'),'na') AS cube_city_name 
    FROM (
      -- Step01:维度统计
        SELECT  CONCAT(''
                       ,COALESCE(prov_name ,''),'#' 
                       ,COALESCE(city_name     ,''),'#' 
                ) AS dim_key
                ,prov_name
                ,city_name
        FROM base_dwd
        GROUP BY prov_name
        ,city_name
    ) base
    GROUP BY dim_key
            ,prov_name
              ,city_name
    GROUPING SETS (
           (dim_key,prov_name)
          ,(dim_key,prov_name,city_name)
    )
  )
)
-- 将CubeID回写到明细记录上,并生成UID粒度的中间过程数据
,detail_ext AS (
  -- Step04:指标统计
  SELECT   user_id
          ,ARRAY_DISTINCT(SPLIT(WM_CONCAT(';',cube_ids),';')) AS cube_id_arry
  FROM (
    -- Step03:CubeID回写明细
    SELECT  /*+ MAPJOIN(dim_cube) */ 
          user_id
        ,cube_ids
    FROM (
        SELECT   user_id
                ,CONCAT(''
                       ,COALESCE(prov_name,''),'#' 
                      ,COALESCE(city_name,''),'#' 
                 ) AS dim_key
        FROM base_dwd
    ) dwd_detail
    JOIN (
        SELECT dim_key,WM_CONCAT(';',cube_id) AS cube_ids
        FROM dim_cube 
        GROUP BY dim_key
    ) dim_cube
    ON dwd_detail.dim_key = dim_cube.dim_key
  ) base
  GROUP BY user_id
)
-- 指标汇总并将CubeID翻译回可理解的维度
,base_dws AS (
  -- Step05:CubeID翻译
  SELECT cube_id
        ,MAX(prov_name) AS prov_name
        ,MAX(city_name    ) AS city_name
        ,MAX(uid_cnt      ) AS user_cnt
  FROM (
      SELECT cube_id              AS cube_id
            ,COUNT(1)             AS uid_cnt
            ,CAST(NULL AS STRING) AS prov_name
            ,CAST(NULL AS STRING) AS city_name
      FROM detail_ext
      LATERAL VIEW EXPLODE(cube_id_arry) arr AS cube_id
      GROUP BY cube_id
      UNION ALL 
      SELECT CAST(cube_id AS STRING) AS cube_id
            ,CAST(NULL AS BIGINT) AS uid_cnt
            ,cube_prov_name       AS prov_name
            ,cube_city_name       AS city_name    
      FROM dim_cube
  ) base 
  GROUP BY cube_id
)
-- 大功告成,输出结果!!!
SELECT   prov_name
        ,city_name
        ,user_cnt
FROM base_dws
;
  • The actual execution process (Logview of ODPS) is as shown below.

4.4. Experimental results

On the left is a new warehouse based on the Cube marking solution. During the experiment process, the experimental data was increased from 10 billion to 20 billion, and the number of combination dimensions was increased from the original 25 to 50. The overall operation took 18 minutes. If only data with the same amount and combination dimensions as the original data were calculated, the overall The calculation run can be controlled within 10 minutes.

On the right is the old warehouse calculated based on inflation data. The experimental data is set to 10 billion, the number of combination dimensions is 25, the intermediate data will expand to 130 billion+, the data size will expand sharply to 1TB+, and the overall operation will take 47 minutes. If this solution is expanded to 20 billion data x 50 combination dimensions of the new method, the intermediate process data will expand to 400 billion+, the data size will expand to 3TB+, and the overall computing indicator will reach 2.5 hours+.

The new method has now been launched in core business lead orders. With the combination of data statistical dimensions and the amount of data calculations increasing significantly, the overall core indicator sidewall is more than 1 hour in advance, which further effectively ensures that the core indicators are data stable. sex.

4.5. Summary of the plan

For common Cube calculation methods based on expanded data, the data calculation size and process data transmission volume will increase linearly with the number of combined dimensions. The more combined dimensions, the more resources and movement trajectories consumed in data expansion and Shuffle transmission. During the experiment, 10 billion experimental data x 25 dimension combination scenarios, the data process has expanded to 130 billion+, and the data size has expanded from 100GB to 1TB. When the amount of data and the number of dimension combinations further increase, the entire calculation process is basically difficult to complete.

In order to solve the problem of a large amount of process data generated during the data expansion process, we reversely operate based on the idea of ​​​​data marking. First, the data is aggregated into UID granularity. In the process, the required combination of dimensions is converted into coded numbers and assigned to detailed data. The entire calculation process data In the form of convergence and aggregation, the data calculation process is stable and will not increase sharply as the combination of dimensions further increases. In the experiment, the experimental data was increased from 10 billion to 20 billion+, the combination dimensions were increased from the original 25 to 50 combination dimensions, and the overall operation was controlled in about 18 minutes. If the same amount of data is used and the old data expansion scheme is used, the intermediate process data will expand to 400 billion+, the data size will increase to 3TB+, and the overall computing run time will reach 2.5 hours+.

To sum up, the overall performance of the current solution is indeed higher than the improvement achieved in the past, and will not increase significantly with the increase in dimension combinations. However, the current solution also has shortcomings, namely the comprehensibility and maintainability of the code. Although the marking calculation process is not fixed, it requires a process of process initialization and understanding as a whole. It is currently not possible to implement ordinary UnionAll /Cube and other solutions are easy to read and write. In addition, when the number of combined dimensions is very small (that is, the data expansion multiple is not high), the performance difference between the two is not big. At this time, it is recommended to use the original ordinary Cube calculation solution; but when the number of combined dimensions reaches dozens of times, you can The idea of ​​using this data method to modify indicators has been compressed. After all, the performance advantage at this time begins to become prominent, and the number of dimension combinations is weighted. This solution has greater performance advantages.

5. Other options

BitMap solution. The core idea is to use non-additive data indicators through a cumulatively calculated data structure to approximately achieve the effect of cumulative indicators. The specific implementation process plan is to encode the user ID and store it in the BitMap structure. For example, a binary bit indicates whether a user exists and consumes 1 Bit. When dimension statistics are rolled up, the BitMap data structure is merged and counted.

HyperLogLog solution. Compared with Distinct's precise deduplication, the performance of imprecise data deduplication is significantly improved.

The performance of these two solutions is greatly improved compared to ordinary Cube calculations, but the BitMap solution requires encoding and storage of the UID used for deduplication statistics, which is more expensive for ordinary users to understand and implement unless it is integrated at the system level. Functionality that would otherwise typically require additional code development to implement. A major drawback of the HyperLogLog solution is the inaccurate statistics of data.

Author|Jiaer

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

Broadcom announced the termination of the existing VMware partner program . Site B crashed twice, Tencent's "3.29" level one incident... Taking stock of the top ten downtime incidents in 2023, Vue 3.4 "Slam Dunk" released, Yakult confirmed 95G data Leaked MySQL 5.7, Moqu, Li Tiaotiao... Taking stock of the (open source) projects and websites that will be "stopped" in 2023 "2023 China Open Source Developer Report" is officially released Looking back at the IDE 30 years ago: only TUI, bright background color …… Julia 1.10 officially released Rust 1.75.0 released NVIDIA launched GeForce RTX 4090 D specially for sale in China
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10560406