Video address: Shang Silicon Valley Big Data Project "Offline Data Warehouse for Online Education"_bilibili_bilibili
Table of contents
P004 [The concept of data warehouse is explained in detail]
P003
Time slicing: Time backtracking to retrieve previous data.
P004 [The concept of data warehouse is explained in detail]
core architecture
- Business data: Data generated by users interacting with the system website, such as order delivery data, are stored in mysql.
- datax: collection of the entire scale.
- maxwell: real-time monitoring of incremental tables.
- User behavior log: a series of actions performed by clicking on the website.
- flume: collect data.
- hdfs: file storage system to store data.
- kafka: To prepare for the construction of real-time data warehouse, flink reads data from kafka.
- hive+hdfs, hive is placed on hdfs, and hierarchical calculations form a data warehouse.
- HDFS: Only supports adding and appending data, and does not support real-time modification and deletion.
- hive: You can use the update command to modify the data. The entire file is read and modified and then overwritten and written back. This is inefficient, so the calculation results are saved in a new table.
Data warehouse stratification
- ods: operation data store, original data layer.
- dwd: data warehouse detail, detailed data layer.
- dws: data warehouse summary, summarizes the data layer.
- dim: dimension, public dimension layer.
- ads: application data service, data application layer.
P018
Idempotence is an important concept, which means that repeated execution of the same operation will not have additional effects, and the result will be the same as the result of executing the operation once. In other words, no matter how many times an operation is performed, the final state is consistent.
P019
The following is the complete process of building a data warehouse.
P020
Transaction business process
P021
5.2.2 Clarify the data domain
In addition to horizontal layering, data warehouse model design usually also requires vertical division of data domains based on business conditions.
The significance of dividing data domains is to facilitate data management and application .
Usually it can be divided according to business process or department. This project is divided according to business process. It should be noted that a business process can only belong to one data domain.
The following are all business processes and data domain division details required for this data warehouse project.
data field
business process
transaction domain
Additional purchase, order placement, payment successful
traffic domain
Page views, app launch, actions, exposure, errors
user domain
register log in
interactive domain
Collection, evaluation
Exam domain
take an exam
learning domain
watch video
P022
The business bus matrix contains all the facts (business processes) and dimensions required by the dimensional model, as well as the relationship between each business process and each dimension. The rows of the matrix are business processes, the columns of the matrix are dimensions, and the intersections of the rows and columns represent the relationship between business processes and dimensions.
P023
According to the design process of the transactional fact table, select Business Process à Declaration Granularity à Confirm Dimension à Confirm Facts . The final business bus matrix obtained is shown in the following table.
P024
5.2.4 Clarify statistical indicators
(1) Atomic indicators
Atomic indicators are based on the measurement value of a certain business process and are indicators that cannot be disassembled in the business definition. The core function of atomic indicators is to define the aggregation logic of indicators. We can conclude that atomic indicators contain three elements, namely business process, measurement value and aggregation logic.
For example, the total order amount is a typical atomic indicator, in which the business process is the user placing an order, the measurement value is the order amount, and the aggregation logic is sum(). It should be noted that atomic indicators are only used to assist in defining the concept of indicators, and usually do not correspond to actual statistical requirements.
(2) Derived indicators
Derived indicators are based on atomic indicators, and their relationship with atomic indicators is shown in the figure below.
(3) Derived indicators
Derived indicators are compounded through various logical operations on the basis of one or more derived indicators. For example, ratio, proportion and other types of indicators. Derived indicators will also correspond to actual statistical needs.