Multidimensional analysis background practice 3: Dimensional sorting and compression

Practice goals

The goal of this issue is to achieve dimensional sorting and compression based on the completion of data type conversion, and to further improve the calculation speed.

Steps to practice:

1. Prepare the basic wide table: modify the code of the previous issue, complete the dimension sorting and compression, and save it as a new group table.

2. Access to the basic wide table: The code of the previous period does not need to be modified, and is directly applied to the new group table.

3. New data addition: Daily new business data is added, and it is sorted and reorganized every month. Try not to affect the performance of accessing the basic wide table as much as possible, while reducing the time required for daily new data.

 

The sample wide table in this issue remains unchanged and remains the customer table. The SQL statement to retrieve wide table data from the Oracle database is select * from customer . The execution result is as follows:

 ..

 

Assuming that the date of the day is 2021-01-12, the SQL for fetching the new data on that day is:

select * from customer where begin_date=to_date('2021-01-12','yyyy-mm-dd')

 

The goal of multi-dimensional analysis and calculation is also unchanged, expressed by the following Oracle SQL statement:

select department_id,job_id,to_char(begin_date,'yyyymm') begin_month ,sum(balance) sum,count(customer_id) count

from customer

where department_id in (10,20,50,60,70,80)

and job_id in ('AD_VP','FI_MGR','AC_MGR','SA_MAN','SA_REP')

and begin_date>=to_date('2002-01-01','yyyy-mm-dd')

and begin_date<=to_date('2020-12-31','yyyy-mm-dd')

and flag1='1' and flag8='1'

group by department_id,job_id,to_char(begin_date,'yyyymm')

 

Prepare wide table

Dimensional sorting compression: order storage. Column storage refers to column storage. When esProc creates a new group table, it defaults to column storage.

Ordered means that the field values ​​are stored in physical order, that is, stored in the group table after sorting by the dimension field. The order of the dimension fields used for sorting is very critical, and the highly repetitive dimensions should be placed first.

The dimensions of this example include: department_id, job_num, employee_id, begin_date, customer_id. Among them, the department number department_id has the least total number (only 11 departments appear in the fact table), so the degree of duplication is the highest. In other fields, job_num, employee_id, begin_date, and customer_id repetitions decrease in order.

At the same time, considering that in practical applications, the appearance of grouping fields is basically the same as the degree of repetition, so the order of the sorting fields can be determined as: department_id, job_num, employee_id, begin_date, customer_id.

We use the database for sorting. Examples are as follows:

select department_id,job_id,employee_id,begin_date,customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8

from customer order by department_id,job_id,employee_id,begin_date,customer_id。

The execution result is shown in the figure:

..

 

Rewrite etl.dfx according to the above requirements, fetch the sorted data from the database, after the type conversion, generate a group table file, and store the basic wide table. The code example is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 =connect@l("oracle")
2 =A1.cursor@d("select   department_id,job_id,employee_id,begin_date,customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8   from customer order by department_id,job_id,employee_id,begin_date,customer_id")
3 =A1.query@d("select   job_id from jobs order by job_id") =file("data/job.btx").export@z(A3)  
4 = A3. (Job_id) =date("2000-01-01")
5 =A2.new(int(department_id):department_id,A4.pos@b(job_id):job_num,int(employee_id):employee_id,int(interval@m(B4,begin_date)*100+day(begin_date)):begin_date,int(customer_id):customer_id,first_name,last_name,phone_number,job_title,float(balance):balance,department_name,int(flag1):flag1,int(flag2):flag2,int(flag3):flag3,int(flag4):flag4,int(flag5):flag5,int(flag6):flag6,int(flag7):flag7,int(flag8):flag8)
6 =file("data/customer.ctx").create@y(#department_id,#job_num,#employee_id,#begin_date,#customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8)
7 =A6.append(A5) >A6.close(),A1.close()

Among them: the SQL statement of A2 adds sorting, and the group table of A6 specifies the sorting field.

Other codes are the same as last issue.

 

The data volume is 100 million. The comparison between the exported group table file and the group table files of previous issues is as follows:

Number of periods File size Description Remarks
Phase one 3.5GB Export directly from the database without optimization
the second term 3.0GB Complete data type optimization
Third period 2.4GB Complete the previous optimization and dimension sorting compression

As can be seen from the above table, after completing the data type optimization, the file size is reduced by 14% (0.5GB). After finishing the dimension sorting and compression, it is reduced by another 20% (0.6GB), and the overall reduction is 31% (1.1GB). The file size becomes smaller, which can reduce the amount of data read by the disk and effectively improve the performance.

Access wide table

The SPL code and Java code for accessing wide tables have not changed from the previous issue.

The total execution time of the Java code plus the results returned by the background calculation is compared with the previous period as follows:

Number of periods Single thread Two threads in parallel Remarks
Phase one 120 seconds 75 seconds
the second term 59 seconds 36 seconds
Third period 21 seconds 15 seconds

It can be seen from the comparison of the above table that the dimensional ordering compression further improves the calculation performance.

Add data

The customer table will have new data every day, and it needs to be added to the group table file regularly every day. If it is a group table file ordered by date, just append the new data generated every day to the end of the file. But our customer group table is ordered by department and other fields. If you add data directly at the end, it will not be ordered as a whole. If the original data and the new data are re-sorted every day, the calculation time will be longer.

We can read new data from the database and sort it by department and other fields, and then use the T.append@a() function to append. esProc will automatically create a new supplement file, and the new data will be merged into the supplement file in order every day. The supplementary files are relatively small and the orderly merging takes less time. You only need to reorganize the customer list file and the supplementary file once a month, and merge the supplementary files into the group list file in an orderly manner.

 

Using this method, write etlAppend.dfx, the grid parameters are as follows:

..

The SPL code is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 if day(today)==1 =file("data/customer.ctx").reset()
2 =connect@l("oracle")
3 =A2.cursor@d("select   department_id,job_id,employee_id,begin_date,customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8   from customer where begin_date=? order by department_id,job_id,employee_id,begin_date  ",today)
4 =A2.query@d("select   job_id from jobs order by job_id")
5 =A4.(job_id) =date("2000-01-01")
6 =A3.new(int(department_id):department_id,A5.pos@b(job_id):job_num,int(employee_id):employee_id,int(interval@m(B5,begin_date)*100+day(begin_date)):begin_date,int(customer_id):customer_id,first_name,last_name,phone_number,job_title,float(balance):balance,department_name,int(flag1):flag1,int(flag2):flag2,int(flag3):flag3,int(flag4):flag4,int(flag5):flag5,int(flag6):flag6,int(flag7):flag7,int(flag8):flag8)
7 =file("data/customer.ctx").open().append@a(A6)
8 >A7.close(),A2.close()

A1: Determine whether the input date is the first day of each month. If it is, execute B1 to reorganize the customer group table, and merge the supplementary files formed by the new data into the customer group table file in an orderly manner.

A2: Connect to oracle database.

A3: Take out the data of the day.

A4: Take out the jobs table data for type conversion.

A5, B5, and A6 are the same as etl.dfx of the previous issue.

A7: Orderly merge today's newly added data into the supplementary file.

A8: Close the file and database connection.

 

etlAppend.dfx needs to be executed regularly every day. The execution method is to use ETL tools or operating system timing tasks, and call esProc scripts through the command line.

E.g:

C:\Program Files\raqsoft\esProc\bin>esprocx d:\olap\etlAppend.dfx


Guess you like

Origin blog.51cto.com/12749034/2602007