Multi-dimensional analysis background practice 2: data type optimization

Practice goals

The goal of this issue is to practice converting the data read from the database into data types that are conducive to performance optimization, such as small integers and floating point numbers.

Steps to practice:

1. Prepare the basic wide table: modify the code of the previous period, complete the data type optimization and save it as a group table file.

2. Access the basic wide table: modify the code of the previous issue, and on the premise that the input parameters remain unchanged, query the group table file after the data conversion, and the result set must return the original data display value. For this requirement, SQL cannot achieve the conversion of incoming parameters and result sets, so the code for accessing wide tables uses SPL as an example.

 

The sample wide table in this issue remains unchanged and remains the customer table. The SQL statement to retrieve wide table data from the Oracle database is select * from customer . The execution result is as follows:

 ..

The fields include:

CUSTOMER_ID NUMBER(10,0), customer number

FIRST_NAME VARCHAR2(20), 名

LAST_NAME VARCHAR2(25), 姓

PHONE_NUMBER VARCHAR2(20), phone number

BEGIN_DATE DATE, account opening date

JOB_ID VARCHAR2(10), occupation number

JOB_TITLE VARCHAR2(32), occupation name

BALANCE NUMBER(8,2), balance

EMPLOYEE_ID NUMBER(4,0), account opening employee number

DEPARTMENT_ID NUMBER(4,0), branch number

DEPARTMENT_NAME VARCHAR2(32), branch structure name

FLAG1 CHAR(1), mark 1

FLAG2 CHAR(1), mark 2

FLAG3 CHAR(1), mark 3

FLAG4 CHAR(1), mark 4

FLAG5 CHAR(1), mark 5

FLAG6 CHAR(1), mark 6

FLAG7 CHAR(1), mark 7

FLAG8 CHAR(1), mark 8

 

The goal of multi-dimensional analysis and calculation is also unchanged, expressed by the following Oracle SQL statement:

select department_id,job_id,to_char(begin_date,'yyyymm') begin_month ,sum(balance) sum,count(customer_id) count

from customer

where department_id in (10,20,50,60,70,80)

and job_id in ('AD_VP','FI_MGR','AC_MGR','SA_MAN','SA_REP')

and begin_date>=to_date('2002-01-01','yyyy-mm-dd')

and begin_date<=to_date('2020-12-31','yyyy-mm-dd')

and flag1='1' and flag8='1'

group by department_id,job_id,to_char(begin_date,'yyyymm')

Prepare wide table

1. Numerical integerization

Some fields in the customer table are integers, such as CUSTOMER_ID, EMPLOYEE_ID, DEPARTMENT_ID.

Approach:

l If the integer type is exported from the database, it can be directly stored in the group table.

l If the type exported from the database is not an integral type, use the type conversion function to cast it to an integral type.

l Attention should be paid to make the integer value less than 65536, so that the performance is best. If the original field value is artificially converted into a larger integer, for example: all the values ​​are added with a 100000 to become 100001, 100002..., the previous 1 must be removed.

Second, string integerization

FLAG1 to FLAG8 are strings, but the stored data is still integer data, which can be converted to integer using the type conversion function.

The JOB_ID field is also a string, and its value is the primary key of the jobs dimension table, which is an enumeration type. We can replace the JOB_ID field with the serial number in the jobs table to achieve integerization.

The jobs table structure and sample data are as follows:

..

Approach:

l Take out the JOB_ID in jobs, arrange the order and form a sequence job. The JOB_NUM field is added to the customer wide table to store the serial number of JOB_ID in the sequence job.

Three, date integerization

In most cases, date data is only used for comparison and does not need to calculate the interval, so it can also be stored as a small integer. In multi-dimensional analysis and calculation, it is common to calculate by year and month. The date after the small integer is required to be able to split the year and month easily.

Approach:

l We can calculate the number of months between the BEGIN_DATE field value and the start of a date, multiply it by 100 and add the day value of BEGIN_DATE to replace the date type data in the group table. The starting date is determined according to the characteristics of the date data, the larger the value, the better.

For example: if we find that all BEGIN_DATEs are after 2000, we can determine that the date start point is 2000-01-01.

After determining the starting point of the date, you can convert the BEGIN_DATE field value in the customer wide table. For example: BEGIN_DATE is 2010-11-20, first calculate the number of whole months different from 2000-01-01 is 130, multiply by 100 and add the daily value of 20 to get the small integer 13020.

With 2000-01-01 as the date starting point, when BEGIN_DATE is less than 2050, the value after integerization is less than 65536. It can be seen that, if the business data allows, the date start point is as late as possible, which can avoid the situation that the date in the wide table exceeds the small integer range.

Fourth, the situation that cannot be integerized

Fields that must be represented by strings, such as FIRST_NAME, JOB_TITLE, etc.;

Fields that must be represented by floating-point numbers, such as amounts, discount rates, and other fields with a fractional part;

Fields that must be represented by a character string plus an integer, such as an international telephone number.

Approach:

l Keep the original value of the field unchanged.

 

According to the above requirements, rewrite etl.dfx, take out the data from the database, after the type conversion, generate the group table file, and store the basic wide table. The code example is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 =connect@l("oracle") =A1.cursor@d("select   * from customer")
2 =A1.query@d("select   job_id from jobs order by job_id") =file("data/job.btx").export@z(A2)  
3 =A2.(job_id) =date("2000-01-01")
4 =B1.new(int(customer_id):customer_id,first_name,last_name,phone_number,int(interval@m(B3,begin_date)*100+day(begin_date)):begin_date,A3.pos@b(job_id):job_num,job_title,float(balance):balance,int(employee_id):employee_id,int(department_id):department_id,department_name,int(flag1):flag1,int(flag2):flag2,int(flag3):flag3,int(flag4):flag4,int(flag5):flag5,int(flag6):flag6,int(flag7):flag7,int(flag8):flag8)
5 =file("data/customer.ctx").create@y(customer_id,first_name,last_name,phone_number,begin_date,job_num,job_title,balance,employee_id,department_id,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8)  
6 =A5.append(A4) >A5.close(),A1.close()

A1: Connect to the pre-configured database oracle, @l means to retrieve the field name in lower case. Note that this is the lowercase letter L.

B1: Create a database cursor and prepare to retrieve the data in the customer table. The customer is the fact table, which is generally large in actual applications, so the cursor method is used to avoid memory overflow. The @d option of the cursor is to convert oracle numeric data into double data instead of decimal data. Decimal data has poor performance in java.

A2: To read the jobs table from the database, only the JOB_ID field is read and sorted. Jobs are dimension tables, which are generally small, so they are read directly into the memory.

B2: Store the data of A2 as a set file for later use.

A3: Convert A2 into sequence.

B3: Definition date 2000-01-01.

A4: Use the new function to define three calculations.

1. CUSTOMER_ID, etc. are determined to be integer values, and converted from double or string to int. The method is to directly use the int function for type conversion. Note that int cannot be greater than 2147483647. For fact tables whose data volume exceeds this value, the primary key of the serial number should be of long type.

2. Convert JOB_ID from string to integer to improve calculation performance. The method is to use the pos function to find the serial number of job_id in A3, which is defined as the JOB_NUM field.

3. Use interval to calculate the number of whole months between begin_date and 2000-01-01, multiply it by 100 and add the daily value of begin_date, convert it to an integer with int and store it as a new begin_date.

A5: Define the group table file. The field name is exactly the same as A4.

A6: While calculating cursor A4, output to the group table file.

B6: Close the group table file and database connection.

 

The data volume is 10 million, and the export group table file is about 344MB. The comparison with the files that were not optimized for data type in the first phase is as follows:

Number of periods File size Description Remarks
Phase one 3.5GB Export directly from the database without optimization
the second term 3.0GB Complete data type optimization

 

As can be seen from the above table, after completing the data type optimization, the file size is reduced by 12% (49M). The file size becomes smaller, which can reduce the amount of data read by the disk and effectively improve the performance.

Access wide table

As mentioned above, many fields of the background group table have been optimized and converted, and there is no way to query them with the original SQL. We use the method of executing scripts to submit parameters such as filter conditions and grouping fields. The background converts the parameter values ​​into optimized data types, and then calculates the group table. Doing so can ensure that the parameters passed in by the universal multi-dimensional analysis front end remain unchanged. Finally, the calculation result also needs to be converted into the corresponding display value.

For example: the passed parameter flag1='1' needs to be converted to flag1=1; the job_num and begin_date in the calculation result must be converted from integers to the string job_id and date.

In order to realize this calculation, the init.dfx file must be written in the main directory of the node server, and the global variable job is pre-loaded for subsequent conversion calculations.

The code of init.dfx is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 =file("data/job.btx").import@ib() =env(job,A1)

A1: Take out the data in the set file, @i means read as a sequence when there is only one column.

B1: Store in the global variable job.

The written init.dfx should be placed in the main directory of the node machine, and it will be automatically called when the node machine is started or restarted.

 

Rewrite olap-spl.dfx according to the data type optimization requirements, use SPL code to access the wide table and perform filtering and grouping summary calculations.

Define grid parameters, and pass in file name, department number, job number, flag bit, date range, grouping field, and aggregation expression respectively.

The parameter setting window is as follows, which is exactly the same as the first phase:

..

Sample parameter values:

filename="data/customer.ctx"

arg_department_id ="10,20,50,60,70,80"

arg_job_id="AD_VP,FI_MGR,AC_MGR,SA_MAN,SA_REP"      

arg_begin_date_min = "2002-01-01"

arg_begin_date_max ="2020-12-31"

arg_flag ="flag1==\"1\"&& flag8==\"1\" "

group="department_id,job_id,begin_yearmonth"

aggregate="sum(balance):sum,count(customer_id):count"

Note: If the group is begin_date, it will be grouped by date, if it is begin_yearmonth, it will be grouped by year and month. For the multi-dimensional analysis front end, it can be considered that there are two fields.

 

The SPL code example is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 =file(arg_filename).open()   =date("2000-01-01")
2 =int(interval@m(B1,arg_begin_date_max)*100+day(arg_begin_date_max)) = int (interval @ m (B1, arg_begin_date_min) * 100 + day (arg_begin_date_min))
3 =arg_job_id.split@c().(job.pos@b(~)) =arg_department_id.split@c().(int(~))
4 =replace(arg_flag,"\"","") =replace(arg_group,"job_id","job_num")
5 =replace(B4,"begin_yearmonth","begin_date\\100:begin_yearmonth")
6 =A1.cursor@m(;B3.contain(department_id)   && A3.contain(job_num) && begin_date>=B2 && begin_date<=A2   && ${A4};2)
7 =A6.groups(${A5};${arg_aggregate})
8 =A7.fname() =A8(A8.pos("job_num"))="job(job_num):job_id"
9 =A8(A8.pos("begin_yearmonth"))="month@y(elapse@m(B4,begin_yearmonth)):begin_yearmonth"
10 =A8(A8.pos("begin_date"))="elapse@m(B4,begin_date\\100)+begin_date%100-1:begin_date"
11 =A7.new(${A8.concat@c()})
12 return A11

A1: Open the group table object. B1: Define the starting date 2000-01-01 for date value conversion in parameters and results.

A2, B2: Convert the incoming date parameter into an integer according to the method described above.

A3: Convert the incoming comma-separated string job_id to the position in the global variable job sequence, which is the job_num integer sequence.

B3: Use the int function to convert the incoming comma-separated string department_id into an integer sequence.

A4: Remove the double quotation marks in the passed flag condition to become an integer condition.

B4: Replace job_id in the incoming group field with job_num.

A5: Replace begin_yearmonth in the incoming group field with begin_date\100. The field value of begin_date is divided by 100 and rounded, which is the number of months between the actual date and the starting date.

A6: Define a cursor with filtering conditions.

A7: Group and summarize the small result set of the cursor calculation.

A8: Form the field names of the A7 result into a sequence.

B8: If there is job_num in the field name, replace it with a conversion statement. The purpose of the statement is to convert job_num in the grouping result to job_id.

A9: If there is begin_yearmonth in the field name, replace it with a conversion statement. The effect is to convert the month difference begin_yearmonth in the grouping field from an integer to yyyymm.

A10: If there is begin_date in the field name, replace it with a conversion statement. The function is to convert the integerized date value in the grouping field to a date type.

A11: Re-connect the replaced A8 with a comma to form a string, and cyclically calculate A7 to complete the conversion of field type and display value.

A12: Return the A11 result set.

 

The execution result is as follows:

..

 

After olap-spl.dfx is written, it can be called as a stored procedure in multi-dimensional analysis. The Java code is the same as the first issue. as follows:

public void testOlapServer(){

       Connection con = null;

       java.sql.PreparedStatement st;

       try{

              // establish connection

              Class.forName("com.esproc.jdbc.InternalDriver");

              // Get connection based on url

              con= DriverManager.getConnection("jdbc:esproc:local://?onlyServer=true&sqlfirst=plus");

              // Call the stored procedure, where olap-spl is the file name of dfx

st =con.prepareCall("call olap-spl(?,?,?,?,?,?,?,?)");

st.setObject(1, "data/customer.ctx");//arg_filename

st.setObject(2, "10,20,50,60,70,80");//arg_department_id

st.setObject(3, "AD_VP,FI_MGR,AC_MGR,SA_MAN,SA_REP");//arg_job_id      

st.setObject(4, "2002-01-01");//arg_begin_date_min

st.setObject(5, "2020-12-31");//arg_begin_date_max

st.setObject(6, "flag1==\"1\"&& flag8==\"1\" ");//arg_flag

st.setObject(7, "department_id,job_id,begin_yearmonth");//arg_group

st.setObject(8, "sum(balance):sum,count(customer_id):count");//arg_aggregate

// execute a stored procedure

st.execute();

// Get the result set

ResultSet rs = st.getResultSet();

// Continue to process the result set and display the result set

             

       }

       catch(Exception e){

              out.println(e);

       }

       finally{

              // close the connection

              if (con!=null) {

                     try {con.close();}

                     catch(Exception e) {out.println(e);       }

              }

       }

}

 

The total execution time of the Java code plus the returned result of the background calculation is compared with the first period as follows:

Number of periods Single thread Two threads in parallel Remarks
Phase one 120 seconds 75 seconds
the second term 59 seconds 36 seconds

As mentioned in the previous issue, the execution time in the table is related to the hardware configuration, and its absolute value is not important. What is important is that the comparison of the above table shows that data type optimization has effectively improved computing performance.


Guess you like

Origin blog.51cto.com/12749034/2602006