Data Warehousing practice-talk (nine) - Incremental / full amount

[table of Contents]

Data Warehousing practice-talk (nine) - Incremental / full amount

Two important concepts of data warehousing are:

  • Immutable data into the warehouse;
  • Record changes in the historical data.

How to understand it? Immutable, meaning into the data warehouse on a comparable filed. In principle, the data inside the warehouse can not be modified; if random data inside the warehouse to be modified, the "warehouse" on the trading system and no difference, can not accurately reflect the business processes play a role. In addition, in data warehouse or storage services, such as Oracle and DB2 are early for the Data Warehouse data warehouse products, as well as a series of Hadoop system components, are for the "bulk insert, no change or a small change" specially designed Therefore in order to achieve the most efficient query optimization. And have consequently two models OLTP systems and OLAP systems.

Therefore, the " data immutable " This is a standard principle.

But the data service system (OLTP) is actually constantly changing, not only new, but also modify existing records. Data Warehouse data immutable characteristics, can be considered rude to save the day snapshot of a business system data to the repository, so that in fact the most complete preservation of the history of business changes every day. But to do so would be a very large amount of data, numerous redundant data continuously into the warehouse - after all, the general situation compared to historical data, the daily change is always a few.

At this time, we have to analyze the data changes. In general, we follow the data change of frequency division, there are several possibilities:

  • Pure New: for example, the bank's transaction flow, the object is to record the transaction itself, transactions occur all the time, constantly insert new trading record. Once the transaction and its contents will not change. If a strip transaction in question, need Reversal, means the revocation process and reverse transactions, the equivalent of another increase in a transaction record.
  • Frequent changes: for example, the account table in the banking system, due to the frequent occurrence of accounting transactions, account balances also lead to changing. Balance is a property account, which means the account table will always be modified. For active clients, basically accounts tables are changing every day.
  • Slowly changing: for example, the user's address, phone, etc., and not often, but it will change. For individual customers, this change may be very common, but for the full set of customer information table, the less you need to think may have changed every day part of the record. Here the problem is to bring a valid pieces of information, such as a customer address with A, B into the next period of time address in a period of time.

In addition, there is a little more variation. Such as bank branches, products, data similar service parameters. First clear, this data will change, and the impact of changes in such data is very large, especially for statistical dimensions. Second, this does not change too much. Moreover, this data is generally very small amount of data (compared to business data).

Another example of data is frequently changed electricity supplier in the Orders table. And it says account balances somewhat similar. After the order is placed, with the change order status, order history will be modified, such as creating, payment, delivery, receipt and other states. If the number of bins collected once data one day, then change process orders within one day will be lost; more than one day to complete the orders will result in modifying the order has entered the data warehouse. If the change in status of the order to be recorded in several positions, one real-time acquisition, the second is to use the order status change water table. Orders table is a characteristic, usually after a period of time, it will not change, such as general data a month (after receipt and returned period) before, basically not changed.

Here insert a digression, a huge trading volume in general trading systems, such as electricity providers, will be put into operation to change the insert operation. For example, just say, changing the water table with the order status to record the status of the order. State changes when inserted up to date, and not to modify the order table. However, this design will bring complex queries, many systems do not do it.

Learn how the data is after the change, we can design appropriate strategies to changing business data into the "immutable" in the data warehouse.

There are several different ways according to the data:

Pure increment

Similar transaction flow, transaction log, the data register and the like, when the data occurs, there is a clear timestamp, and will not change after the data occurs, such as the above said account transaction flow tables, records can not be changed after production. According to the time stamp can be selected directly out of the day's data, these data directly into full-scale, data can be added daily.

Will generally increase alone a date field to indicate when the incoming data.

Contrast increment

Similarly Account table, the user information table like the master data table or a state table information, often in the trading system only records up to date and will not record changes over time. Of course, there are also reservations system operation log records to change the situation.

For the former, we need to own the latest data warehouse and data to make a comparison to identify the data has been changed.

For the latter, if the source system to do a comparison, identify the self increment to increment data warehouse platform you do not need comparison. But need to consider whether to delete the data marked out, though is not much direct physical deleted, but still need to consider.

Change data including add, modify, delete. Deleted here refers to the real physical delete. Tombstone marking the data needs analysis depending on the circumstances, such as the business meaning really is deleted, just press the delete processing. But caution that way. Good design, this case should be rare. For example, accounts written off, we do not delete this account records physically, but should add a business state called "write-off." According to data change processing.

Here there have been processing "data validity" of the. A recording D1日进入系统 in D2日被修改. Recorded in the warehouse of the two records is as follows:

	记录一R1:A记录内容,时间戳:D1,状态:新增;
	记录二R2:A记录的最新内容,时间戳:D2,状态:更新;

R1 is valid for the D1 to D2 (that does not contain D2), R2 is a valid date D2.
The data is deleted, you can put a copy of the latest data, it does increase the current date time stamp, a status of "deleted", and then into a warehouse table. which is:

记录三R3:A记录的最新内容,时间戳:D3,状态:删除;

As way of comparison, nothing tricky place, holding the latest data one by comparing it to a new date data warehouse just fine.

Recorded inside the warehouse is only valid for the start date and the end date is not in use, it would be inconvenient. For example, find out the current valid record you have to find the greatest start date. Thus, there are two optimization or modification on the basis of the mechanism described above:

  • A snapshot table in advance. After processing the data every day, you can generate a snapshot table for this table, which only records up to date, do not record the change history. By comparison table to find incremental snapshots, rather than deal with the full amount in history. Of course, if the amount of data snapshot table itself is great, we need a good Hengliangdeshi.
  • Increasing the effective cut-off date. But this leads to the need to update the repository data. This is contrary to the principle of non-renewable. This requires cooperation warehousing tools (database, HIVE, etc.), use zoning mechanisms (usually a partition is a separate file), delete the partition and then change impact of reconstruction. The general process is to create a temporary table, the need to update the data into them, then delete the partition table corresponds to the warehouse, then plug it back in new data (such as the HIVE INSERT OVERWRITE). This approach is also called "zipper table."

Either way, always to remember are:

Do not do warehouse table update (update) operation!

Contrast is software that space and time are eternal contradiction session topic. Here is the same.

Full amount of processing

For similar institutions, product / product type, and other "classification" data, or some business parameters, often the amount of data, dozens, hundreds of such. And trading system does not care whether the data changes, the status quo is what it can be used directly. However, this data repository, is often an important dimension of statistics, but also have different requirements as several summary:

  1. According to previous data before the subtotals, data after the change according to the new classification summary;
  2. Now the data is aggregated with the previous classification;
  3. Previous data aggregated with the current classification.

The first is the conventional demand. The latter two will encounter, especially want to compare what is the difference before and after the change.

Nevertheless, from the point of view into the warehouse, the full amount of data is relatively simple, the day after the time-stamped data into the warehouse to the full amount. That is, a full-day volume of data, use the time to select a point in time according to need.

To be continued.

Previous: Data Warehousing practice-talk (eight) - de-duplication

Next: Data Warehousing practice-talk (X) - zipper processing

Published 20 original articles · won praise 7 · views 2490

Guess you like

Origin blog.csdn.net/cfy_fantasyxx/article/details/103891879