Relational database design (double foreign key)

The design method in this paper is mainly used in large-scale comprehensive data analysis system, because of its many types of access data sources and unstable data. The so-called instability means that after the data enters the data warehouse, the external data will change. The key is that these changes will affect the overall data analysis. For various data aggregation strategies collected in a general data warehouse, the aggregated data can improve the overall analysis efficiency, but the cost of updating the aggregated data is extremely high, which will cause a chain reaction, affecting waves of data . The design of double foreign keys is mainly to deal with such unstable data sources, for data analysis systems with diverse data sources and data sources that cannot be constrained by themselves.

There are so-called primary keys and foreign keys in relational databases. These are the basic characteristics of the database and are also collectively referred to as relational keys. Usually, the relationship key is used to represent the association relationship in the domain model, which is also the most common method of use. The use of the relationship key discussed in this article is different from that. It is used for the relationship key in a specific scenario. Such a scenario will also Appears in different application systems, so this design method should be universal and suitable for the model design of each data warehouse.

1 Application Scenario Description

We first simplify the domain model in the retail system, there are only three models: store, shopping guide and order, as shown in the following figure:

  1. Each table in the above figure has logic_id, which can be MySQL auto-increment ID or Oracle Sequence, which can uniquely identify a domain instance, but the disadvantage is that it does not contain any business information and cannot accurately describe an instance with business meaning. It is impossible to match the existing instance according to the actual instance, and it is often necessary to identify an instance after a combination of multiple fields. From the perspective of database performance, the more attributes that are matched, the lower the performance, especially when the amount of data is too large. the more obvious the perception.
  2. The three tables in the above figure contain code respectively. Code is a concept in the field and can represent the logical relationship between different instances, but the disadvantage is that their data is unstable. There are two reasons for the instability. The first is the store The code of the ERP and the code of the shopping guide will change. These changes may be caused by the errors of the merchants themselves or are normal changes; the second is that when the merchants switch ERP, the relevant codes have to change. Since the code will definitely change in the actual scene, the code cannot be used as the primary key normally during the construction of the data warehouse.
  3. According to the table design in the above figure, we can also use logic_id for data association regardless of the performance factor, which can ensure that when the code changes, only the corresponding code needs to be modified, and the fact data and dimension data are associated with the internal logic_id. The problem of changing business foreign keys can be solved in the application system of . When a change occurs, the data warehouse is regarded as a new instance, and the later fact data is unified with the new instance, so that there will be one logical instance, but there are two instances in the data warehouse. Since the application system is not necessarily completely controlled, it cannot sense changes in advance, and as a digital warehouse, it can only respond passively to changes.
  4. Another solution is to use logic_id as the attribute that uniquely identifies an instance in the data warehouse. In reality, in the same merchant dimension, when the ERP is replaced, the logic_id may be repeated at this time, and an instance cannot be uniquely identified persistently.

To sum up, the traditional design of associated foreign keys cannot solve the uniqueness of data in the data warehouse, and also cannot ensure the consistency of data statistics and retrieval after the business primary key changes. We need a new design method that can ensure data consistency while minimizing the amount of data changes.

2 Explanation of the concept of double foreign keys

2.1 Main Concepts

First of all, let's understand the concept of factual data. Facts are facts and will never change. All factual data is associated with a stable dimension data. In addition, it is necessary to describe the facts truly and reflect the changes in the data. At this time, for the real description and stable association, this paper introduces two important concepts:

  1. Logical model instance: The so-called logical instance is different from the physical instance. It is an abstracted concept. When the code in the business instance changes, or after the ERP is replaced, the generated instance will only store the final instance in the application system, that is The changed instance can be completely ignored in the application system for historical data; however, it is different in the data warehouse. The data warehouse needs to store all the data in the past. At this time, all changes will be stored in the form of logical instances, that is It is said that every change will generate a new instance. After a store, shopping guide or other dimension data in the application system changes, multiple instances will be stored in the data warehouse to reflect the change process of the data.
  2. Association key: The association key here is different from the business primary key in the general sense. It is a combination of a set of attributes that uniquely identify a business instance, and the combination is FarmHashed to generate a 64-bit integer value with a very low collision rate, for example: The association key of the shopping guide is farmHash (shopCode + guiderCode), and the dimension data is based on the key and the corresponding fact data to ensure a stable association relationship, and is not affected by changes in external data.
  3. Variable associative key: The variable associative key is mainly to respond to changes. The data dimension data changes in the application system. The data warehouse cannot avoid changing the data, but the change of the data needs to be minimized, so the variable associative key Design is very important, it has two functions, one is to associate with external application systems, which is mainly used for retrieval and grouping, and the other is to associate with logical instances in the data warehouse

The textual description is not intuitive enough, but if you read it carefully, you can understand it more thoroughly. Let's analyze the principle of double-association bonds through real cases:

Notes:

There are two codes in the store table shops in the above picture, namely code and assoc_code. When the data does not change, the values ​​of the two fields remain the same. Once the code changes, the assoc_code field value of the old record will be the same as that of the new record. The code is consistent, mainly used for data grouping and filtering;
1) The generation rules of the hashed_id and shop_id fields are consistent, both are farmHash (shop_id). If there are multiple primary keys, they will be hashed after splicing
. 2) The above picture describes the store The association relationship between the table shops and the order table is based on hashed_id and shop_id as the primary foreign key relationship;
3) The store of 0003 below the store table is a new record in the warehouse table, while in the application system it is store 0002
4) The data below the orders table is the new order data generated after the store code is changed, so the shop_id is associated with the shop_id of the new store ;

2.2 Main Concepts

Based on the design of the above table structure, there are two business primary key codes in the dimension table shops, and the association between the dimension table shops and the fact table orders is associated with the value after hash, and the hash value is generated by business data and has business characteristics. Even if the data is lost after misoperation, it will not affect the relationship of the data. The actual query example is as follows:

SELECT t1.assoc_code, t1.name, COUNT(DISTINCT t2.no), SUM(t2.amount) FROM shops t1
LEFT JOIN orders t2 On t1.hashed_id = t2.shop_id
GROUP BY t1.assoc_code, t1.name

The above queries are grouped by store, and the order quantity and sales amount are counted. This design can ensure that after the data of the application system is changed, the data in the data warehouse changes to a minimum, and the change of dimension data is much easier than the change of fact data. , this design can also be applied to the design of the aggregation model, the historical data can be aggregated in the form of days and months, but it should be noted that the design of hash_id needs to increase the corresponding dimension, otherwise the data in the Cartesian product will be generated Repeat, shopping guide statistics results are wrong.

3 Summary

The design of double foreign keys has two key features: 1) The same foreign key is stored in two copies, one for recording history, and the other for responding to changes and providing external retrieval and grouping capabilities; 2) The association between the dimension table and the fact table is associated with hash values ​​with business characteristics, rather than self-incrementing meaningless data. The sequence of data entering the data warehouse can be random, and the design of the acquisition system can be very flexible.

Such a model design fully addresses the flexibility of data warehouse data, and also reduces the impact of changes in external systems on data warehouse data. In particular, the impact on the data aggregation model is far-reaching. Usually, the data stored in big data cannot be modified after aggregation, but the change of dimensional data in actual scenarios is unavoidable, which requires the model design of the data warehouse to be compatible with such scenarios. , many current big data systems are basically solved by recalculation or human flesh. Therefore, the design of double foreign keys will effectively improve the flexibility and availability of data systems, reduce the requirements for external systems, and improve development efficiency.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324074220&siteId=291194637