Data Warehouse Practice丨Performance bottleneck caused by too many filtered rows during table scan

This article is shared from the Huawei Cloud Community " GaussDB (DWS) Performance Tuning: A Case Study of Performance Bottleneck Problems Caused by Excessive Number of Filtered Rows During Table Scans " by O Paoguolai~.

1. [Problem Description]

During the execution of the SQL statement, a large table with a data volume of 1.2 billion is scanned, and 99% of the data is filtered, leaving only 617 rows of data. The performance bottleneck lies in scanning the table.

2. [Original statement]

set search_path = 'bi_dashboard'; 

WITH F_SRV_DB_DIM_PRD_D AS (SELECT EXTERNAL_NAME FROM ( SELECT MKT_NAME EXTERNAL_NAME   
		       FROM BI_DASHBOARD.DM_MSS_ITEM_PRODUCT_D PRD 
		      WHERE PRD.COMPANY_BRAND =any(array[string_to_array('HUAWEI',',')]) 
		            
	          AND PRD .MKT_NAME =any (array[string_to_array('Enjoy 60, Enjoy 50, Enjoy 60X, Enjoy 60 Pro, Enjoy 50 Pro, Enjoy 50z, nova 10z, Enjoy 20e, Enjoy 20 Pro, Enjoy 10e, Enjoy 10 Plus, Enjoy 20 SE, Enjoy 10, nova 11i, Enjoy 20 Plus, Enjoy 9 Plus, Enjoy 20 5G, nova Y90, Enjoy 10S, nova Y70, Enjoy Z, Enjoy 9S, nova 8 SE Active Edition, Maimang 9 5G, Y9s, Maimang 9 5G',',')]) ) 
	            WHERE EXTERNAL_NAME<>'SNULL' GROUP BY EXTERNAL_NAME), 

V_PERIOD AS 
 ( 
  SELECT PERIOD_ID AS PERIOD_ID_M, 
         LEAST(TO_CHAR(PERIOD_END_DATE, 'YYYYMMDD '), '20230630') AS PERIOD_ID, 
         PERIOD_ID AS DATES 
    FROM BI_DASHBOARD.RPT_TML_ACCOUNT_PERIOD_D 
   WHERE PERIOD_TYPE = 'M' 
     AND PERIOD_ID BETWEEN 202207 AND 202306 
 ), 
 
V_DATA_BASE AS 
 ( 
  SELECT A.PERIOD_ID, 
         IFNULL(A.CHANNEL_ NAME, 'SNULL') AS DISTRIBUTOR_CHANNEL_NAME, 
         SUM(A.SO_QTY_MTD) AS SO_QTY, 
         SUM(DECODE(A.PERIOD_ID, 20230630, A.SO_QTY_MTD)) AS SO_QTY_ORDER 
 select count(*) FROM DM_MSS_CN_PC_REP_RP_ST_D_F A 
   INNER JOIN F_SRV_DB_DIM_PRD_D PRD 
      ON A.EXTERNAL_NAME = PRD.EXTERNAL_NAME 
   WHERE 1 = 1 
     AND A.CHANNEL_ID IN ('100013388802') 
     AND A.ORG_KEY IN (10000651) 
    
     AND A.SALES_FLAG IN ('1', '0') 
     AND A.PERIOD_ID IN (20220731,20221031,20220930,20220831,20221130, 20221231,20230131,20230228,20230430,20230331,20230531,20230630) 
     AND (A.SO_QTY_MTD <> 0) -- Filter all data whose date SO_QTY is 0 
   GROUP BY A.PERIOD_ID, 
            IFNULL(A.CHANNEL_NAME, 'SNULL' ) 
 ) , 
 
V_DATA AS 
 ( 
  SELECT PERIOD_ID, 
         NVL(DISTRIBUTOR_CHANNEL_NAME, 'Total') AS DISTRIBUTOR_CHANNEL_NAME, 
         SUM(SO_QTY) AS SO_QTY, 
         SUM(SO_QTY_ORDER) AS SO_QTY_ORDER 
    FROM V_DATA_BASE A 
   GROUP BY GROUPING SETS ((PERIOD_ID), (PERIOD_ID, DISTRI BUTOR_CHANNEL_NAME)) 
 ) 

  SELECT STRING_AGG(P.DATES, ',' ORDER BY P.PERIOD_ID_M) AS PERIOD_LIST, 
         B.DISTRIBUTOR_CHANNEL_NAME, 
         STRING_AGG(NVL(TO_CHAR(ROUND(A.SO_QTY)), '0'), ',' ORDER BY P.PERIOD_ID_M ) AS SO_QTY 
    FROM V_PERIOD P 
    FULL JOIN (SELECT DISTINCT DISTRIBUTOR_CHANNEL_NAME FROM V_DATA) B 
      ON 1 = 1 
    LEFT JOIN V_DATA A 
      ON A.PERIOD_ID = P.PERIOD_IDPERIOD_ID = P.PERIOD_ID
     AND A.DISTRIBUTOR_CHANNEL_NAME = B.DISTRIBUTOR_CHANNEL_NAME
   GROUP BY B.DISTRIBUTOR_CHANNEL_NAME
   ORDER BY DECODE(B.DISTRIBUTOR_CHANNEL_NAME, 'Total', 0, 'SOURCE IS NULL', 2, '源为空', 3, 'SNULL', 4,  1), 
            SUM(A.SO_QTY_ORDER) DESC NULLS LAST
   LIMIT 50 OFFSET 0

3. [Performance Analysis]

image.png
image.png
As can be seen from the performance execution plan in the figure above (the complete execution plan is in Appendix 1), the SQL statement is slow in scanning table a (bi_dashboard.dm_mss_cn_pc_rep_rp_st_d_f_test). The filter conditions during scanning include: sales_flag, so_qty_mtd, channel_id, org_key, period_id. The original local clustering key PCK on the table only contains period_id and does not include one of the other three filter conditions. Therefore, the PCK can be adjusted to reduce Execution time of scanning table a.

Supplement: local clustering key

Partial Cluster Key (PCK) is an index technology under column storage that uses min/max sparse indexes to achieve fast scanning of base tables. Partial Cluster Key can specify multiple columns, but it is generally not recommended to exceed 2 columns. PCK is suitable for accelerating point queries on large column-stored tables.

In addition, there are many in values ​​(12) in the where condition in the view statement. In DWS, the conditions after in can only be 5 by default. If there are more than 6, the filtering will not be pushed down . At this time, you can use or to combine the 12 Value rewriting,

A.PERIOD_ID IN (20220731,20221031,20220930,20220831,20221130)
or A.PERIOD_ID IN (20221231,20230131,20230228,20230430,20230331)
or A.PERIOD_ID IN (20230531,20230630)

image.png

At this time, the SQL statement execution time is reduced to 487ms, and the complete performance plan is shown in Appendix 2.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10141543