Data Warehousing practice-talk (X) - zipper processing

[table of Contents]

Data Warehousing practice-talk (X) - zipper processing

Modern business data processing systems is growing, especially in large financial institutions, and other electronic business platform, accounts table, the Orders table is huge. To preserve the history of the data warehouse to change the situation, you need to load the changed data to the warehouse the day every day. Compared to the overall total amount of data, the daily changing data or belonging to minorities. For example, ten million of bank account-level daily trading volume is generally hundreds of thousands, which means the account table involved in the change record that is up to several hundred thousand. Electricity supplier orders table may number 10 million, but new orders every day as well as before the change, it may be less than one million. In this case, the zipper way to do incremental storage is the most appropriate method.

Considering that most of the source system simply trading system, and will not do pre-processing increment. So, often the scene we are faced with the source system to a snapshot of the current system (included in all orders or valid time) every day, and there are a large number have not been modified, and a small part has been modified, the new data. So we need to do two things there:

  1. Find out the increment (add, change) of data;
  2. The additional incremental data to the history database.

Consider the case of storing the historical database, first representation on the table zipper detail:

  • First, currently in effect for the recording deadline may be set to 9999-12-31, date of end of the earth is substantially; query record currently in effect, use may be equal 9999-12-31 deadline as a condition; this can reducing the total amount of contrast workload;
  • In general, the amount of data stored in a partition full day; the day after comparison of the data of the data partition where the data changes all out (that is, A day there is a change of the data, the data warehouse A Day partition all out), and new data are combined into a temporary table; the changes relate to delete the partition and re-inserted. "Delete + Insert" instead of "update" operation.
  • The full amount of the data area can reflect the state of the business at any point in time, that is, any given date, you can restore the business case for a specific date, the lifetime of the data record has a start date and due date are more likely to do so.

As a general case, the data processing once a day, T + 1 data into the warehouse. No. 2 is a data warehouse processing is No. 1.
Use a typical order table, for example:
Orders table example

A simplified order table of contents. Which states include: order, payment, shipping, receiving four. When order data is created, formed the order number, order content, a status of "single" one-time record, the other fields blank. This means that the line every time a change occurs state, "state" field is modified, the corresponding time of payment, delivery time and the time of receipt and other content will increase (also modified, changed from the time of occurrence of an empty corresponding action) . For simplicity, we assume that the order status changes once a day.

In the whole history table which amount, according to the definition zippers table, we changed the structure as follows:

Warehouse Orders table

Increase the serial number, start time, deadline and status (add, modify, delete).

Assume a daily new orders, order status changes once a day.

The current follows the full scale data (A table):

A table

January 3 business end, general operating system will start to do the data export operation in the early hours of January 4, then transmits the data to the warehouse. 4 January 3rd to get data (T-1) to (B table):

Table B

The current system date is 4 May, the date of the business process is three days. That is new or changed in this process is the beginning of the 3rd period, while historical data is changed record date is valid until 2nd.

3, the data can be divided into the following three parts:

  1. The new date of the order;
  2. Before the order status has changed;
  3. No change orders.

The metadata profile, "the next single" timestamp field, this field value T-1 (3 single time, T is the system date) can be directly determined is new record, so this part of the data can be acquire directly out into a warehouse table.

The remaining data, you need to find those records changed. Due to over only a snapshot of the current state of the data from the snapshot table itself can not know what the record changes, the need for comparative and historical data in the table can. If the processing in a relational database, is not complicated, just select the full scale of the effective deadline for 9999, order number associated with two tables and then, and find all the contents of the field are not the same record.

But for database understanding people should know that this would be very, very slow, especially when large amounts of data. You can use distributed processing to the idea, save the data partition, distributed processing. Introduce this later.

The focus here talk about how to compare the "similarities and differences." This field is generally a lot of business table, comparing field one by one is time consuming. A better approach is relatively fast by Hash Algorithm (hash), you can also choose a common MD5, are represented by the following Hash. Hash algorithm is called the input of arbitrary length, by a hash algorithm, converted into a fixed-length output, the output is the hash value. But remember, this conversion is a compression map, i.e., the space hash value is typically much smaller than the input space, different inputs may hash to the same output, but from the hash value can not be uniquely the input value is determined. Hash value is different records, content is certainly different, but the same record Hash values ​​have to compare specific contents.

Specific approach is to add a field full-scale, the original record field service string length calculated Hash value obtained by splicing a specific order (e.g., phonetic dictionary of the value field must be fixed order). Day snapshot table also add a field, each record also Hash value calculated in accordance with the same order after splicing. For the same order number and the full scale of the expiration date of record 9999, more Hash value, if different, that is, there is a change, if the same, then compare specific field which is really the same.

This approach for more data changes, the more obvious advantages.

After completion of the comparison process result data in three ways:

  • You need to enter full-scale change to the order data;
  • Full scale data changes due to the need to modify the expiration date of the data;
  • No change in the data, that there is no change in the order of the day.

The following table (not included here single data Nisshin 3):

Change data

Remember, the whole scale can not be changed. Therefore, it needs to be formed on a temporary table as a table, then the full scale in the partition No. 1 deleted, and then, to the temporary tables inserted. Of course, you can also add the Hash field. Each partition is processed, to form a final full-scale.

Whole quantity

Mention one, Hash algorithm is more obvious effect of changes in zoning, but there is no data for the partition change, there is no effect, it can only be determined once all the fields must be compared. For Orders table this scenario, the changes are mainly concentrated in the most recent date, for example, within the last month. Further orders, basically modify the probability is very low. So for this form of data Orders table, preferably determined from the change order business longest period, such as over a period returned, not be changed. Other subsequent maintenance service to record a separate table. Otherwise it will be very time consuming incremental process.

To be continued.

Previous: Data Warehousing practice-talk (nine) - Incremental / full amount

Next: Data Warehousing practice-talk (XI) - distributed processing increment

Published 20 original articles · won praise 7 · views 2489

Guess you like

Origin blog.csdn.net/cfy_fantasyxx/article/details/104057386