Shang Silicon Valley Big Data Project "Offline Data Warehouse for Online Education" Notes 001

Video address: Shang Silicon Valley Big Data Project "Offline Data Warehouse for Online Education"_bilibili_bilibili

Table of contents

P003

P004 [The concept of data warehouse is explained in detail]

P018

P019

P020

P021

P022

P023

P024


P003

Time slicing: Time backtracking to retrieve previous data.

P004 [The concept of data warehouse is explained in detail]

core architecture

  1. Business data: Data generated by users interacting with the system website, such as order delivery data, are stored in mysql.
    1. datax: collection of the entire scale.
    2. maxwell: real-time monitoring of incremental tables.
  2. User behavior log: a series of actions performed by clicking on the website.
    1. flume: collect data.
    2. hdfs: file storage system to store data.
    3. kafka: To prepare for the construction of real-time data warehouse, flink reads data from kafka.
  3. hive+hdfs, hive is placed on hdfs, and hierarchical calculations form a data warehouse.
    1. HDFS: Only supports adding and appending data, and does not support real-time modification and deletion.
    2. hive: You can use the update command to modify the data. The entire file is read and modified and then overwritten and written back. This is inefficient, so the calculation results are saved in a new table.

Data warehouse stratification

  1. ods: operation data store, original data layer.
  2. dwd: data warehouse detail, detailed data layer.
  3. dws: data warehouse summary, summarizes the data layer.
  4. dim: dimension, public dimension layer.
  5. ads: application data service, data application layer.

P018

Idempotence is an important concept, which means that repeated execution of the same operation will not have additional effects, and the result will be the same as the result of executing the operation once. In other words, no matter how many times an operation is performed, the final state is consistent.

P019

The following is the complete process of building a data warehouse.

P020

Transaction business process

P021

5.2.2 Clarify the data domain

In addition to horizontal layering, data warehouse model design usually also requires vertical division of data domains based on business conditions.

The significance of dividing data domains is to facilitate data management and application .

Usually it can be divided according to business process or department. This project is divided according to business process. It should be noted that a business process can only belong to one data domain.

The following are all business processes and data domain division details required for this data warehouse project.

data field

business process

transaction domain

Additional purchase, order placement, payment successful

traffic domain

Page views, app launch, actions, exposure, errors

user domain

register log in

interactive domain

Collection, evaluation

Exam domain

take an exam

learning domain

watch video

P022

The business bus matrix contains all the facts (business processes) and dimensions required by the dimensional model, as well as the relationship between each business process and each dimension. The rows of the matrix are business processes, the columns of the matrix are dimensions, and the intersections of the rows and columns represent the relationship between business processes and dimensions.

P023

According to the design process of the transactional fact table, select Business Process à Declaration Granularity à Confirm Dimension à Confirm Facts . The final business bus matrix obtained is shown in the following table.

P024

5.2.4 Clarify statistical indicators

(1) Atomic indicators

Atomic indicators are based on the measurement value of a certain business process and are indicators that cannot be disassembled in the business definition. The core function of atomic indicators is to define the aggregation logic of indicators. We can conclude that atomic indicators contain three elements, namely business process, measurement value and aggregation logic.

For example, the total order amount is a typical atomic indicator, in which the business process is the user placing an order, the measurement value is the order amount, and the aggregation logic is sum(). It should be noted that atomic indicators are only used to assist in defining the concept of indicators, and usually do not correspond to actual statistical requirements.

(2) Derived indicators

Derived indicators are based on atomic indicators, and their relationship with atomic indicators is shown in the figure below.

(3) Derived indicators

Derived indicators are compounded through various logical operations on the basis of one or more derived indicators. For example, ratio, proportion and other types of indicators. Derived indicators will also correspond to actual statistical needs.

Online education offline data warehouse indicator system

Guess you like

Origin blog.csdn.net/weixin_44949135/article/details/132270043