Data warehouse theory (a)

Reprinted https://www.cnblogs.com/frankdeng/p/9462754.html

concept of data warehouse

1 What is a data warehouse

Data warehouse, the English name for the Data Warehouse, can be abbreviated as DW or DWH. Data warehouse, is to develop the process for all levels of corporate decision-making, providing strategic support of all types of data collection. It is for analytical reporting and decision support purposes created. The need for enterprise business intelligence, business process improvement to provide guidance, monitoring time, cost, and quality control.

2 Data Warehouse can you do?

1) specify the annual sales target, we need to make decisions based on past history reports, not racking our brains.

2) how to optimize business processes

For example: a complete electricity supplier website orders include: browse, order, payment, logistics, where the logistics chain and the possibility of cooperation in the pass, Shen Tong, rhyme and other courier companies. Each delivery courier company an order, the order will have to confirm the time of delivery, which can be analyzed relatively fast and efficient courier delivery time to order, so chose to partner with which courier companies, courier companies which removed, increase user-friendly.

3 Data Warehouse Features

1) data warehouse is subject-oriented

Traditional database-oriented applications and data corresponding to the characteristics of the organization, the data warehouse is subject-oriented organized. What is the theme of it? First, the subject is an abstract concept, is a comprehensive data on higher-level enterprise information system, classified and analyzed using the abstract. In a logical sense, it is an object correspondence analysis of a company in the field of macroeconomic analysis involved. Subject-oriented data organization, is a complete, consistent description at a higher level of analysis of the data subject, can fully and consistently characterize individual analysis of the data objects involved in business, and between data contact. The so-called higher level is relatively data-oriented applications in terms of organization, it refers to the way data is organized by topic with a higher level of data abstraction.

2) data warehouse is an integrated

Data warehouse database data is extracted from the original dispersed come. The difference between operational data and analytical data DSS great. First, each of theme data warehouse data corresponding to the source and duplicate many inconsistencies in the original database and each dispersion, and different from the data line system and various applications are tied to logic; second , integrated data warehouse data can not be obtained directly from the original database system. Therefore, before the data enters the data warehouse, is bound to be unified and integrated, this step is building a data warehouse is the most critical and most complex step, there is work to be accomplished:

(1) at all contradictory to unify the source data, such as the field of homonyms, synonyms synonymous, the unit is not uniform, word length and inconsistent.

(2) comprehensive data and calculations. Data integration work in the data warehouse can be generated when extracting data from legacy database, but many within the data warehouse generation, generation after that is integrated into the data warehouse.

3) data warehouse is not updatable

The main data warehouse for business decision analysis purposes, data manipulation is mainly involved in data query, do not modify operating under normal circumstances. Data warehouse reflects the content for a significant period of time historical data, the database is a collection of snapshots at different points, and statistics, and comprehensive restructuring of export data based on these snapshots, rather than online data processing. Online processing of data in the database through the integrated input into the data warehouse, data storage period once the data stored in the data warehouse has exceeded the data warehouse, the data deleting from the current data warehouse. Because the data warehouse query only data, so the data warehouse management systems comparison database management system is much simpler. Database management systems are many technical difficulties, such as the protection of the integrity, concurrency control, and so on, almost be eliminated in the management data warehouse. However, due to the amount of data warehouse queries tend to be large, so the data query put forward higher requirements, which require complex indexing techniques; and because the data warehouse oriented commercial enterprise's senior management, they will friendly interface and data representation of the query data put forward higher requirements.

4) data warehouse is constantly changing over time

Data in the data warehouse is not updated for applications, that is, user data warehouse analysis is not performed when the data update operations. But not to say that at the beginning of the input data from the data warehouse to integrate data throughout the life cycle eventually deleted, all of the data warehouse data is always the same.

Data warehouse over time and is constantly changing, it is the fourth feature data warehouse data. This feature in the following three aspects:

(1) data warehouse over time continue to add new data content. Data warehouse data capture system must constantly OLTP database changes, added to the data warehouse to go, that is, to continue to generate a snapshot OLTP databases, data warehouses increased by a unified and integrated post-go; but does not change the database snapshots, if new change data capture, only generate a new database snapshot added to it, will not modify the original database snapshot.

(2) data warehouses continue to change over time deleting old data content. Data warehouse also has storage period, once over this period, expired data will be deleted. Only time data within the data warehouse data is much longer than the time limit operational environment. In operation environments generally stored only 60 to 90 days of data, the data warehouse in a longer time is required to save the data (e.g., 5-10 years), to accommodate the requirements of DSS for trend analysis.

(3) data warehouse contains a large number of integrated data, many of these combined with the time-related data, such data is often integrated by time period, or every certain time slice sampling the like. The data to be re-integrated over time constantly. Therefore, the data warehouse contains characteristic data entry time, to indicate the historical period of the data.

Second data warehouse development

Data warehouse development has gone through three such processes:

Simple stage 1 report

At this stage, the main goal of the system is to address some of the daily reporting work business people need, as well as generate some simple summary data can help leaders make decisions need. Most manifestations of this phase of the database and front-end reporting tool.

Phase 2 data mart

This stage is mainly based on the need of a business unit, a certain degree of data collection, collation, in accordance with the needs of business people, conduct multi-dimensional presentation of the report, can provide data on specific operational guidance, and can provide specific leadership decision-making data.

Phase 3 Data Warehouse

At this stage, mainly according to certain data model, the data across the enterprise to collect, organize, and able to follow the needs of various business units, providing cross-sectoral, exactly the same business report data can be generated by the data warehouse for business having data guidance, while providing comprehensive data to support decision-making for the leadership.

By stage of development of data warehouse construction, we can see that an important distinction construction building data warehouses and data marts is that support data model. Therefore, the construction of the data model for the construction of our data warehouse, has a decisive significance.

Three differences database and data warehouse

Before understanding the difference between a database and data warehouse, first master the three concepts. Database software, databases, data warehouses.

1 database software

It is a software that can be visible, can operate. The database used to implement logic functions. A physical layer belonging.

2 database

It is a logical concept used to store the data warehouse. Implemented by database software. Database consists of many tables, the table is two-dimensional, a table can have many fields. Field lined up, line by line corresponding to the data is written to the table. Table in the database, that can show two-dimensional multi-dimensional relationship. The current market popular databases are two-dimensional database. Such as: Oracle, DB2, MySQL, Sybase, MS SQL Server and so on.

3 Data Warehouse

Upgraded database concepts. Understanding Logically, there is no difference between databases and data warehouses, data are stored by the local implementation of database software, but the amount of data, the data warehouse database is much more than huge. The main data warehouse for data mining and data analysis, auxiliary leaders make decisions.

In the IT system architecture, the database must exist. You must have a place to store data. For example, now online shopping, Taobao, Jingdong and so on. The number of inventory items, prices, goods account balance of the user class. These data are stored in the back-end database. Or the easiest to understand, we are now Weibo, QQ and other account user name and password. In the back-end database must have a user table, there are at least two fields, namely user name and password, then our data table above there line by line. When we logged in, we fill in the user name and password, the data will be transmitted back to go back, talk to table the above data, the matching is successful, you will be able to login. No match will get an error saying the password is wrong or not this user names. This is the database, the database is used in a production environment to work in. All linked with business applications, we use the database.

Data warehouse is a technique in which the BI. Because the database is linked with business applications, so a database can not be installed at all the data of a company. Table database design are often designed for a particular application. For example, just the sign-on capabilities, user on this table there were only two fields, the field is no other. But this table consistent with the application, there is no problem. However, this table does not meet the analysis. For example, I want to know what period of time, up to the amount of user login? Which user shopping up to one year? Indicators such as these. It would have to redesign the structure of the database table. For data analysis and data mining, data warehouse we introduce the concept. Table structure of the data warehouse is in accordance with the needs analysis, dimensional analysis, design analysis indicators.

The difference between the database and the data warehouse is actually talking about the difference between OLTP and OLAP.

Operational process, called online transaction processing OLTP (On-Line Transaction Processing), also known as transaction-oriented processing system, it is in the daily operation of the database online, usually for a small number of records, and modified for specific business. More concerned about problems of user operation response time, data security, integrity and concurrency support the number of users. Traditional database management system as the primary means of data, mainly for operational processing.

Analytical process, called online analytical processing OLAP (On-Line Analytical Processing) are generally analyzed historical data for certain topics, to support management decisions.

Comparison Table operational processing and analytical processing

Operational processing

Analytical processing

Details

Comprehensive or refined

Entity - Relationship (ER) model

Star or snowflake model model

Instantaneous data access

Storage of historical data, does not include the most recent data

Updatable

Read-only append

A first operation unit

A collection of one operation

High performance requirements, a short response time

Performance requirements relaxed

Transaction-oriented

Oriented Analysis

A small amount of data operation

The amount of data in one operation

Support daily operations

Support decision-making needs

A small amount of data

Big amount of data

Customer orders, inventory levels and bank account checking

Customers benefit analysis, market segmentation, etc.

Four hierarchical data warehouse architecture

1 Data Warehouse Architecture

Data warehouse can be divided into four criteria: ODS (temporary memory layer), PDW (data warehouse floor), DM (data mart layer), APP (application layer).

1) ODS layer:

为临时存储层,是接口数据的临时存储区域,为后一步的数据处理做准备。一般来说ODS层的数据和源系统的数据是同构的,主要目的是简化后续数据加工处理的工作。从数据粒度上来说ODS层的数据粒度是最细的。ODS层的表通常包括两类,一个用于存储当前需要加载的数据,一个用于存储处理完后的历史数据。历史数据一般保存3-6个月后需要清除,以节省空间。但不同的项目要区别对待,如果源系统的数据量不大,可以保留更长的时间,甚至全量保存;

2)PDW层:

为数据仓库层,PDW层的数据应该是一致的、准确的、干净的数据,即对源系统数据进行了清洗(去除了杂质)后的数据。这一层的数据一般是遵循数据库第三范式的,其数据粒度通常和ODS的粒度相同。在PDW层会保存BI系统中所有的历史数据,例如保存10年的数据。

3)DM层:

为数据集市层,这层数据是面向主题来组织数据的,通常是星形或雪花结构的数据。从数据粒度来说,这层的数据是轻度汇总级的数据,已经不存在明细数据了。从数据的时间跨度来说,通常是PDW层的一部分,主要的目的是为了满足用户分析的需求,而从分析的角度来说,用户通常只需要分析近几年(如近三年的数据)的即可。从数据的广度来说,仍然覆盖了所有业务数据。

4)APP层:

为应用层,这层数据是完全为了满足具体的分析需求而构建的数据,也是星形或雪花结构的数据。从数据粒度来说是高度汇总的数据。从数据的广度来说,则并不一定会覆盖所有业务数据,而是DM层数据的一个真子集,从某种意义上来说是DM层数据的一个重复。从极端情况来说,可以为每一张报表在APP层构建一个模型来支持,达到以空间换时间的目的数据仓库的标准分层只是一个建议性质的标准,实际实施时需要根据实际情况确定数据仓库的分层,不同类型的数据也可能采取不同的分层方法。

2 为什么要对数据仓库分层?

1)用空间换时间,通过大量的预处理来提升应用系统的用户体验(效率),因此数据仓库会存在大量冗余的数据。

2)如果不分层的话,如果源业务系统的业务规则发生变化将会影响整个数据清洗过程,工作量巨大。

3)通过数据分层管理可以简化数据清洗的过程,因为把原来一步的工作分到了多个步骤去完成,相当于把一个复杂的工作拆成了多个简单的工作,把一个大的黑盒变成了一个白盒,每一层的处理逻辑都相对简单和容易理解,这样我们比较容易保证每一个步骤的正确性,当数据发生错误的时候,往往我们只需要局部调整某个步骤即可。

五 元数据介绍

当需要了解某地企业及其提供的服务时,电话黄页的重要性就体现出来了。元数据(Metadata)类似于这样的电话黄页。

1 元数据的定义

    数据仓库的元数据是关于数据仓库中数据的数据。它的作用类似于数据库管理系统的数据字典,保存了逻辑数据结构、文件、地址和索引等信息。广义上讲,在数据仓库中,元数据描述了数据仓库内数据的结构和建立方法的数据。

      元数据是数据仓库管理系统的重要组成部分,元数据管理器是企业级数据仓库中的关键组件,贯穿数据仓库构建的整个过程,直接影响着数据仓库的构建、使用和维护。

(1)构建数据仓库的主要步骤之一是ETL。这时元数据将发挥重要的作用,它定义了源数据系统到数据仓库的映射、数据转换的规则、数据仓库的逻辑结构、数据更新的规则、数据导入历史记录以及装载周期等相关内容。数据抽取和转换的专家以及数据仓库管理员正是通过元数据高效地构建数据仓库。

(2)用户在使用数据仓库时,通过元数据访问数据,明确数据项的含义以及定制报表。

(3)数据仓库的规模及其复杂性离不开正确的元数据管理,包括增加或移除外部数据源,改变数据清洗方法,控制出错的查询以及安排备份等。

元数据可分为技术元数据和业务元数据。技术元数据为开发和管理数据仓库的IT人员使用,它描述了与数据仓库开发、管理和维护相关的数据,包括数据源信息、数据转换描述、数据仓库模型、数据清洗与更新规则、数据映射和访问权限等。而业务元数据为管理层和业务分析人员服务,从业务角度描述数据,包括商务术语、数据仓库中有什么数据、数据的位置和数据的可用性等,帮助业务人员更好地理解数据仓库中哪些数据是可用的以及如何使用。

由上可见,元数据不仅定义了数据仓库中数据的模式、来源、抽取和转换规则等,而且是整个数据仓库系统运行的基础,元数据把数据仓库系统中各个松散的组件联系起来,组成了一个有机的整体,如图所示

2 元数据的存储方式

     元数据有两种常见存储方式:一种是以数据集为基础,每一个数据集有对应的元数据文件,每一个元数据文件包含对应数据集的元数据内容;另一种存储方式是以数据库为基础,即元数据库。其中元数据文件由若干项组成,每一项表示元数据的一个要素,每条记录为数据集的元数据内容。上述存储方式各有优缺点,第一种存储方式的优点是调用数据时相应的元数据也作为一个独立的文件被传输,相对数据库有较强的独立性,在对元数据进行检索时可以利用数据库的功能实现,也可以把元数据文件调到其他数据库系统中操作;不足是如果每一数据集都对应一个元数据文档,在规模巨大的数据库中则会有大量的元数据文件,管理不方便。第二种存储方式下,元数据库中只有一个元数据文件,管理比较方便,添加或删除数据集,只要在该文件中添加或删除相应的记录项即可。在获取某数据集的元数据时,因为实际得到的只是关系表格数据的一条记录,所以要求用户系统可以接受这种特定形式的数据。因此推荐使用元数据库的方式。

      元数据库用于存储元数据,因此元数据库最好选用主流的关系数据库管理系统。元数据库还包含用于操作和查询元数据的机制。建立元数据库的主要好处是提供统一的数据结构和业务规则,易于把企业内部的多个数据集市有机地集成起来。目前,一些企业倾向建立多个数据集市,而不是一个集中的数据仓库,这时可以考虑在建立数据仓库(或数据集市)之前,先建立一个用于描述数据、服务应用集成的元数据库,做好数据仓库实施的初期支持工作,对后续开发和维护有很大的帮助。元数据库保证了数据仓库数据的一致性和准确性,为企业进行数据质量管理提供基础。

3 元数据的作用

      在数据仓库中,元数据的主要作用如下。

(1)描述哪些数据在数据仓库中,帮助决策分析者对数据仓库的内容定位。

(2)定义数据进入数据仓库的方式,作为数据汇总、映射和清洗的指南。

(3)记录业务事件发生而随之进行的数据抽取工作时间安排。

(4)记录并检测系统数据一致性的要求和执行情况。

(5)评估数据质量。

六 星型模型和雪花模型

在多维分析的商业智能解决方案中,根据事实表和维度表的关系,又可将常见的模型分为星型模型和雪花型模型。在设计逻辑型数据的模型的时候,就应考虑数据是按照星型模型还是雪花型模型进行组织。

1 星型模型

当所有维表都直接连接到“ 事实表”上时,整个图解就像星星一样,故将该模型称为星型模型。

星型架构是一种非正规化的结构,多维数据集的每一个维度都直接与事实表相连接,不存在渐变维度,所以数据有一定的冗余,如在地域维度表中,存在国家A 省B的城市C以及国家A省B的城市D两条记录,那么国家A和省B的信息分别存储了两次,即存在冗余。

2 雪花模型

当有一个或多个维表没有直接连接到事实表上,而是通过其他维表连接到事实表上时,其图解就像多个雪花连接在一起,故称雪花模型。雪花模型是对星型模型的扩展。它对星型模型的维表进一步层次化,原有的各维表可能被扩展为小的事实表,形成一些局部的" 层次" 区域,这些被分解的表都连接到主维度表而不是事实表。如图所示,将地域维表又分解为国家,省份,城市等维表。它的优点是:通过最大限度地减少数据存储量以及联合较小的维表来改善查询性能。雪花型结构去除了数据冗余。

星型模型因为数据的冗余所以很多统计查询不需要做外部的连接,因此一般情况下效率比雪花型模型要高。星型结构不用考虑很多正规化的因素,设计与实现都比较简单。雪花型模型由于去除了冗余,有些统计就需要通过表的联接才能产生,所以效率不一定有星型模型高。正规化也是一种比较复杂的过程,相应的数据库结构设计、数据的 ETL、以及后期的维护都要复杂一些。因此在冗余可以接受的前提下,实际运用中星型模型使用更多,也更有效率。

3 星型模型和雪花模型对比

星形模型和雪花模型是数据仓库中常用到的两种方式,而它们之间的对比要从四个角度来进行讨论。

  1)数据优化

雪花模型使用的是规范化数据,也就是说数据在数据库内部是组织好的,以便消除冗余,因此它能够有效地减少数据量。通过引用完整性,其业务层级和维度都将存储在数据模型之中。

 

雪花模型

相比较而言,星形模型使用的是反规范化数据。在星形模型中,维度直接指的是事实表,业务层级不会通过维度之间的参照完整性来部署。

星形模型

  2)业务模型

主键是一个单独的唯一键(数据属性),为特殊数据所选择。在上面的例子中,Advertiser_ID就将是一个主键。外键(参考属性)仅仅是一个表中的字段,用来匹配其他维度表中的主键。在我们所引用的例子中,Advertiser_ID将是Account_dimension的一个外键。

在雪花模型中,数据模型的业务层级是由一个不同维度表主键-外键的关系来代表的。而在星形模型中,所有必要的维度表在事实表中都只拥有外键。

  3)性能

第三个区别在于性能的不同。雪花模型在维度表、事实表之间的连接很多,因此性能方面会比较低。举个例子,如果你想要知道Advertiser 的详细信息,雪花模型就会请求许多信息,比如Advertiser Name、ID以及那些广告主和客户表的地址需要连接起来,然后再与事实表连接。

而星形模型的连接就少的多,在这个模型中,如果你需要上述信息,你只要将Advertiser的维度表和事实表连接即可。

  4)ETL

雪花模型加载数据集市,因此ETL操作在设计上更加复杂,而且由于附属模型的限制,不能并行化。

星形模型加载维度表,不需要再维度之间添加附属模型,因此ETL就相对简单,而且可以实现高度的并行化。

  总结

雪花模型使得维度分析更加容易,比如“针对特定的广告主,有哪些客户或者公司是在线的?”星形模型用来做指标分析更适合,比如“给定的一个客户他们的收入是多少?”

 
分类:  数据仓库

Guess you like

Origin www.cnblogs.com/qinxiaoqin/p/12050673.html