[SQL Development Practical Skills] Series (28): Data Warehouse Report Scenario ☞Personnel Distribution and How to Realize Simultaneous Gathering of Different Groups (Partitions)

Series Article Directory

[SQL development practical skills] series (1): those things that have to be said about SQL
[SQL development practical skills] series (2): simple single table query
[SQL development practical skills] series (3): those things about SQL sorting
[SQL Development Practical Skills] Series (4): Discuss the Precautions for Using UNION ALL and Empty String & UNION and OR from the Execution Plan
[SQL Development Practical Skills] Series (5): Look at the Efficiency of IN, EXISTS and INNER JOIN from the Execution Plan , we need to divide the scenarios and don’t memorize the online conclusion
[SQL Development Practical Skills] series (6): Look at the efficiency of NOT IN, NOT EXISTS and LEFT JOIN from the execution plan, and remember that internal and external association conditions should not be misplaced
[SQL Development Practical Skills] series ( Seven): Let’s talk about how to compare the difference data and the corresponding number of records in two tables under the premise of duplicate data
[SQL development practical skills] series (eight): talk about how to insert data, which is more flexible than constraints to restrict data insertion And how does an insert statement insert multiple tables at the same time
[SQL development practical skills] series (9): An update mistakenly updates other column data to be empty? Merge rewrite update! Give you five ways to delete duplicate data!
[SQL Development Practical Skills] Series (10): Starting from splitting strings, replacing strings, and counting the number of occurrences of strings
[SQL Development Practical Skills] Series (11): Take a few cases to talk about translate|regexp_replace| listagg|wmsys.wm_concat|substr|regexp_substr Commonly used functions
[SQL development practical skills] series (12): Three questions (how to sort the strings in alphabetical order after deduplicating the letters of the string? How to identify which strings contain numbers? How to convert delimited data into a multivalued IN list?)
[SQL Development Practical Skills] Series (13): Discuss common aggregate functions & see sum() over () through the execution plan to accumulate employee wages
[SQL Development Practical Skills] Series (14): Calculate the balance after consumption &Calculate the cumulative sum of bank turnover & calculate the top three employees in each department's salary
[SQL development practical skills] series (fifteen): Find the data information of the row where the most value is located and quickly calculate the sum of max/min() keep() over(), fisrt_value, last_value, ratio_to_report
[SQL development practical skills] series (16): time type operation in data warehouse (primary) day, month, year, hour, minute, second difference and time interval calculation [SQL
development Practical Skills] Series (Seventeen): Time type operations in data warehouses (primary) determine the number of working days between two dates, calculate the number of occurrences of each date in the week of the year, and determine the difference between the current record and the next record Number of days
[SQL development practical skills] series (18): time type operations in data warehouse (advanced) INTERVAL, EXTRACT and how to determine whether a year is a leap year and the calculation of the week [SQL development practical skills]
series (19): How to print the calendar of the current month or year with one SQL in the time type operation (advanced) in the data warehouse? How to determine the date of the first and last day of the week in a month?
[SQL Development Practical Skills] Series (20): Time Type Operations in Data Warehouse (Advanced) Obtain Quarter Start and End Time and How to Count Discontinuous Time Data
[SQL Development Practical Skills] Series (21): Data Time type operations in the warehouse (advanced) Identify overlapping date ranges, and summarize data at specified 10-minute intervals
[SQL development practical skills] series (22): Data warehouse report scenario ☞ Is the efficiency of the analysis function must be fast Chat 1 Talk about the implementation of result set paging and interlaced sampling
[SQL Development Practical Skills] Series (23): Data Warehouse Report Scenario ☞ How to de-duplicate data permutations and how to find the record containing the maximum and minimum values? Use the execution plan again to prove to you that the performance of the analysis function is not good. Must be high
[SQL development practical skills] series (24): data warehouse report scenario ☞ Detailed explanation of "row to column" and "column to row" through case execution plan [SQL development practical skills]
series (25 ): Data warehouse report scenario ☞ Duplicate data in the result set is only displayed once and the efficient way to write the salary difference of the calculation department and how to quickly group data
[SQL development practical skills] series (26): Data warehouse report scenario ☞ chat How ROLLUP and UNION ALL perform group totals respectively and how to identify which rows are the result rows for summary
[SQL Development Practical Skills] Series (27): Data Warehouse Report Scenario ☞Analytical functions are explained in detail by aggregating moving ranges The principle of window opening and how to print the ninety-nine multiplication table with one SQL
[SQL development practical skills] series (28): Data warehouse report scenario ☞ personnel distribution and how to achieve simultaneous aggregation of different groups (partitions)



foreword

The main content of this article is: the problem of spatial distribution of personnel through row-to-column conversion (work is displayed as a column, and each employee is displayed as a row), issues that should be paid attention to in continuous row-to-column conversion, different groups and partitions at the same time through execution plans Realize the aggregation requirements: It is required to list the number of employees in the department and position in the detailed data of the employee table! !
[SQL development practical skills] This series of bloggers writes as a review of old knowledge. After all, SQL development is very important and basic in data analysis scenarios. Interviews will often ask about SQL development and tuning experience. I believe that when I finish writing this A series of articles can also gain something, and you can also face SQL interviews with ease in the future~.


1. The distribution of personnel in the workspace

Now there is a requirement: Each job is required to be displayed as a column, and each employee is displayed as a row. When the employee corresponds to the job, it is displayed as yes, and if it does not correspond, it is displayed as empty!
What about this requirement?
In fact, we can use the PIVOT function to group by job and employee, and set the corresponding position to be:

SQL> select * from (select ename,job from emp)
  2  pivot(
  3  max('是')
  4  for job in(
  5    'ANALYST' as ANALYST,
  6    'CLERK' as CLERK,
  7    'MANAGER' as MANAGER,
  8    'PRESIDENT' as PRESIDENT,
  9    'SALESMAN' as SALESMAN
 10    )
 11  );

ENAME      ANALYST CLERK MANAGER PRESIDENT SALESMAN
---------- ------- ----- ------- --------- --------
ADAMS              是                      
ALLEN                                      是
BLAKE                    是                
CLARK                    是                
FORD       是                              
JAMES              是                      
JONES                    是                
KING                             是        
MARTIN                                     是
MILLER             是                      
SCOTT      是                              
SMITH              是                      
TURNER                                     是
WARD                                       是

14 rows selected

This statement is equivalent to group by ename,job.

2. Create a sparse matrix

To increase the difficulty of the above problem, the current requirement is: the corresponding position is directly displayed as the employee's name, and the distribution among departments is increased. Because the data is not summarized, it can still be processed by PIVOT. The query statement is as follows:

SQL> 
SQL> select *
  2    from (select empno, ename, ename as ename2, job, deptno from emp)
  3  pivot(max(ename)
  4     for deptno in(10 as d10, 20 as d20, 30 as d30))
  5  pivot(max(ename2)
  6     for job in('ANALYST' as ANALYST,
  7                'CLERK' as CLERK,
  8                'MANAGER' as MANAGER,
  9                'PRESIDENT' as PRESIDENT,
 10                'SALESMAN' as SALESMAN
 11                ));

EMPNO D10        D20        D30        ANALYST    CLERK      MANAGER    PRESIDENT  SALESMAN
----- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
 7900                       JAMES                 JAMES                            
 7369            SMITH                            SMITH                            
 7499                       ALLEN                                                  ALLEN
 7521                       WARD                                                   WARD
 7566            JONES                                       JONES                 
 7654                       MARTIN                                                 MARTIN
 7698                       BLAKE                            BLAKE                 
 7782 CLARK                                                  CLARK                 
 7788            SCOTT                 SCOTT                                       
 7839 KING                                                              KING       
 7844                       TURNER                                                 TURNER
 7876            ADAMS                            ADAMS                            
 7902            FORD                  FORD                                        
 7934 MILLER                                      MILLER                           

14 rows selected

Note: If there is a summary of the data, do not use this method with two PIOVTs. Because this query is actually equivalent to the nesting of two PIVOT clauses.
In the previous article, there is a count case when statement, as follows:

SQL> 
SQL> select count(case
  2                 when deptno = 10 then
  3                  ename
  4               end) as deptno_10,
  5         count(case
  6                 when deptno = 20 then
  7                  ename
  8               end) as deptno_20,
  9         count(case
 10                 when deptno = 30 then
 11                  ename
 12               end) as deptno_30,
 13         count(case
 14                 when job = 'ANALYST' then
 15                  job
 16               end) as ANALYST,
 17         count(case
 18                 when job = 'CLERK' then
 19                  job
 20               end) as CLERK,
 21         count(case
 22                 when job = 'MANAGER' then
 23                  job
 24               end) as MANAGER,
 25         count(case
 26                 when job = 'PRESIDENT' then
 27                  job
 28               end) as PRESIDENT,
 29         count(case
 30                 when job = 'SALESMAN' then
 31                  job
 32               end) as SALESMAN
 33    from emp;

 DEPTNO_10  DEPTNO_20  DEPTNO_30    ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         3          5          6          2          4          3          1          4

Let's try to rewrite it with PIOVT to see what happens. The original PIOVT statement is as follows:

SQL> 
SQL>  select *
  2     from (select  ename, ename as ename2, job, deptno from emp)
  3   pivot(count(ename)
  4      for deptno in(10 as d10, 20 as d20, 30 as d30))
  5   pivot(count(ename2)
  6      for job in('ANALYST' as ANALYST,
  7                 'CLERK' as CLERK,
  8                 'MANAGER' as MANAGER,
  9                 'PRESIDENT' as PRESIDENT,
 10                 'SALESMAN' as SALESMAN
 11                 ));

       D10        D20        D30    ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         0          0          1          0          1          1          0          4
         0          1          0          2          2          1          0          0
         1          0          0          0          1          1          1          0

SQL> 

You can see the data, which is inconsistent with the results of case when. The following is changed to a nested method for analysis.
Nesting first step:

SQL> with t as (
  2  select *
  3     from (select  ename, ename as ename2, job, deptno from emp)
  4   pivot(count(ename)
  5      for deptno in(10 as d10, 20 as d20, 30 as d30))
  6  )
  7  select * from t;

ENAME2     JOB              D10        D20        D30
---------- --------- ---------- ---------- ----------
FORD       ANALYST            0          1          0
KING       PRESIDENT          1          0          0
WARD       SALESMAN           0          0          1
ADAMS      CLERK              0          1          0
ALLEN      SALESMAN           0          0          1
BLAKE      MANAGER            0          0          1
CLARK      MANAGER            1          0          0
JAMES      CLERK              0          0          1
JONES      MANAGER            0          1          0
SCOTT      ANALYST            0          1          0
SMITH      CLERK              0          1          0
MARTIN     SALESMAN           0          0          1
MILLER     CLERK              1          0          0
TURNER     SALESMAN           0          0          1

14 rows selected

The first step is equivalent to group by empno,job.
Nested example second step:

SQL> with t as
  2   (select *
  3      from (select ename, ename as ename2, job, deptno from emp)
  4    pivot(count(ename)
  5       for deptno in(10 as d10, 20 as d20, 30 as d30)))
  6  select *
  7    from t
  8  pivot (count(ename2) for job in('ANALYST' as ANALYST,
  9                             'CLERK' as CLERK,
 10                             'MANAGER' as MANAGER,
 11                             'PRESIDENT' as PRESIDENT,
 12                             'SALESMAN' as SALESMAN));

       D10        D20        D30    ANALYST      CLERK    MANAGER  PRESIDENT   SALESMAN
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         0          0          1          0          1          1          0          4
         0          1          0          2          2          1          0          0
         1          0          0          0          1          1          1          0

SQL> 

Because the column returned in the first step is (ENAME2, JOB, D10, D20, D30), after removing (ENAME2, JOB), the rest is (D10, D20, D30). So the second
step is equivalent to group by D10,D20,D30.
But what we want is to calculate the count of the emp table according to the job grouping and the departmental grouping. They are the statistical combination of the two combinations of the emp table, instead of grouping count on the basis of condition 1.

3. Simultaneous aggregation of different groups and partitions

Now there is a requirement: It is required to list the number of employees in the department and position in the detailed data of the employee table.

Before using the analysis function, this kind of demand needs to use self-correlation:

SQL> with t as
  2   (select count(*) as cnt from emp),
  3  t1 as
  4   (select deptno, count(*) as dcnt from emp group by deptno),
  5  t2 as
  6   (select job, count(*) as jcnt from emp group by job)
  7  select emp.ename,
  8         emp.deptno,
  9         t1.dcnt,
 10         emp.job,
 11         t2.jcnt,
 12         (select * from t) as cnt
 13    from emp
 14   inner join t1
 15      on (emp.deptno = t1.deptno)
 16   inner join t2
 17      on (emp.job = t2.job);

ENAME      DEPTNO       DCNT JOB             JCNT        CNT
---------- ------ ---------- --------- ---------- ----------
FORD           20          5 ANALYST            2         14
SCOTT          20          5 ANALYST            2         14
MILLER         10          3 CLERK              4         14
JAMES          30          6 CLERK              4         14
ADAMS          20          5 CLERK              4         14
SMITH          20          5 CLERK              4         14
CLARK          10          3 MANAGER            3         14
BLAKE          30          6 MANAGER            3         14
JONES          20          5 MANAGER            3         14
KING           10          3 PRESIDENT          1         14
TURNER         30          6 SALESMAN           4         14
MARTIN         30          6 SALESMAN           4         14
WARD           30          6 SALESMAN           4         14
ALLEN          30          6 SALESMAN           4         14

14 rows selected


SQL> 

Take a look at the execution plan:

 Plan Hash Value  : 

------------------------------------------------------------------------------
| Id  | Operation               | Name      | Rows | Bytes | Cost | Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |           |   14 |   868 |   12 | 00:00:01 |
|   1 |   VIEW                  |           |    1 |    13 |    1 | 00:00:01 |
|   2 |    SORT AGGREGATE       |           |    1 |       |      |          |
|   3 |     INDEX FULL SCAN     | IDX_EMPNO |   15 |       |    1 | 00:00:01 |
| * 4 |   HASH JOIN             |           |   14 |   868 |   11 | 00:00:01 |
| * 5 |    HASH JOIN            |           |   13 |   559 |    7 | 00:00:01 |
|   6 |     VIEW                |           |    3 |    78 |    4 | 00:00:01 |
|   7 |      SORT GROUP BY      |           |    3 |     9 |    4 | 00:00:01 |
|   8 |       TABLE ACCESS FULL | EMP       |   15 |    45 |    3 | 00:00:01 |
| * 9 |     TABLE ACCESS FULL   | EMP       |   13 |   221 |    3 | 00:00:01 |
|  10 |    VIEW                 |           |    5 |    95 |    4 | 00:00:01 |
|  11 |     SORT GROUP BY       |           |    5 |    40 |    4 | 00:00:01 |
|  12 |      TABLE ACCESS FULL  | EMP       |   15 |   120 |    3 | 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 4 - access("EMP"."JOB"="T2"."JOB")
* 5 - access("EMP"."DEPTNO"="T1"."DEPTNO")
* 9 - filter("EMP"."JOB" IS NOT NULL AND "EMP"."DEPTNO" IS NOT NULL)

This way of writing is more complicated, and it needs to visit the table emp four times (because I have built an index. So one has gone away from the index).
If you use an analytic function instead, the statement is simpler:

SQL> select emp.ename,
  2         emp.deptno,
  3         count(*) over(partition by deptno) dcnt,
  4         emp.job,
  5         count(*) over(partition by job) jcnt,
  6         count(*) over() as cnt
  7    from emp
  8  ;

ENAME      DEPTNO       DCNT JOB             JCNT        CNT
---------- ------ ---------- --------- ---------- ----------
MILLER         10          3 CLERK              4         14
KING           10          3 PRESIDENT          1         14
CLARK          10          3 MANAGER            3         14
SMITH          20          5 CLERK              4         14
SCOTT          20          5 ANALYST            2         14
ADAMS          20          5 CLERK              4         14
FORD           20          5 ANALYST            2         14
JONES          20          5 MANAGER            3         14
WARD           30          6 SALESMAN           4         14
MARTIN         30          6 SALESMAN           4         14
TURNER         30          6 SALESMAN           4         14
ALLEN          30          6 SALESMAN           4         14
JAMES          30          6 CLERK              4         14
BLAKE          30          6 MANAGER            3         14

14 rows selected

Look at the execution plan:

 Plan Hash Value  : 4086863039 

----------------------------------------------------------------------
| Id | Operation             | Name | Rows | Bytes | Cost | Time     |
----------------------------------------------------------------------
|  0 | SELECT STATEMENT      |      |   15 |   255 |    5 | 00:00:01 |
|  1 |   WINDOW SORT         |      |   15 |   255 |    5 | 00:00:01 |
|  2 |    WINDOW SORT        |      |   15 |   255 |    5 | 00:00:01 |
|  3 |     TABLE ACCESS FULL | EMP  |   15 |   255 |    3 | 00:00:01 |
----------------------------------------------------------------------

From the perspective of the execution plan, the table is scanned once.
But didn't I have two articles before me that kept saying that you should be cautious when using analytical functions? Why do I recommend it to everyone?
When encountering such a situation where the same table is accessed multiple times, you can try to see if it can be rewritten with an analysis function, and how efficient it is after rewriting. If you can get the conclusion that "the performance improvement is obvious" by analyzing the execution plan like I am now ,
then of course your scene can be used, and of course the most important point is: don’t forget to check the data after rewriting! This is a very important point.


Summarize

The main content of this article is: the problem of spatial distribution of personnel through row-to-column conversion (work is displayed as a column, and each employee is displayed as a row), issues that should be paid attention to in continuous row-to-column conversion, different groups and partitions at the same time through execution plans Realize the aggregation requirements: It is required to list the number of employees in the department and position in the detailed data of the employee table! !

Guess you like

Origin blog.csdn.net/qq_28356739/article/details/129789765
Recommended