Jianzhi Data Warehouse-Project (3)

1. Review of the last lesson

2. The theoretical knowledge of the data warehouse

3. ERP project structure

1. Review of the last lesson

  • https://blog.csdn.net/SparkOnYarn/article/details/105430370
  • Mainly talked about some modules of ERP (basic information maintenance, procurement, sales, retail process maintenance), warehouse thresholds, warehouse inventory; a bunch of data collection, mining effective value, providing enterprise decisions; designing big data project architecture should consider upstream Data type (log type, DB type); reports and real-time large screens, the underlying storage data generally stores only the smallest granularity of multi-dimensional summary data; static reports and dynamic reports complement each other;
  • Offline data warehouse layering, ODS (operation data store)-> DWD (data warehouse detail)-> DWS (data warehouse service)-> ADS (application data store)

Second, the scene (business table supplement updatetime field)

There is a createtime and updatetime in the table-building specification in MySQL. MySQL business tables have different development levels, if the updatetime field is not added; MySQL-> Hive, there is data:

MySQL Hive
id、value、createtime id、value、createtime
1 100 2019-12-10 10:00:00 1 100 2019-12-10 10:00:00
  • This data occurred on 12-10, but we changed the data at 10 am on 12-11 as follows: 1 200 2019-12-10 10:00:00, the problem appeared:
    The data extracted on the 12th at zero is the entire day of the 11th. According to the createtime extraction, we cannot extract the data generated on December 10 that we modified on December 11; we cannot extract this on the 12th. Data.

  • The SQL statement is as follows: (select * from t where createtime> = '2019-12-10 00:00:00' and createtime <'2019-12-11 00:00:00'), which means that from December 10 The data between the zero point of the number and the zero point of December 11;

T + 1 mode: one card a day, 11th only draws the data of the 10th, 12th only draws the data of the 11th, in fact, is a sql to find out the data.

summary:

updatetime field: does not affect the business, it is done by the database itself, without any changes to the upstream code sql:

The SQL is as follows:

  • alter table xxx
    add column updatetime timestamp not null default current_timestamp on update current_timestamp;

  • Vivid understanding:

id value createtime updatetime
1 100 2019-12-10 00:00:00 2019-12-10 00:00:00
1 200 2019-12-10 00:00:00 2019-12-11 10:00:00

Therefore, the updatetime field is used for extraction in production:
1. The test in the QA environment is used for a period of time and then pushed to production;
2. MySQL is a master-slave architecture deployment: master node- > write slave node-> read- > Apply for another MySQL to add the updatetime field from the slave node;
3. If you can't do the first and second steps, you need to delete the data in Hive and pull it in full before each sqoop extracts data;
4 There is a binlog binary file in MySQL-> canal, maxwell-> Kafka-> flume-> hive / hdfs (pseudo real time)
-> Spark / Flink-> hive / hbase (real time)

What about deleting data?
delete from t where id = 100; physical deletion (how to delete in Hive when thinking about physical deletion in MySQL ??)
update t set delflag = 1 where id = 100; logical deletion

2.1, data modeling-star model

The essence is to establish a table structure: Commodities-> SKU. If the upstream table is very standardized (main table, detailed table, commodity table, supplier table, commodity category table), the downstream is very easy;

Dimensional modeling: contains three models:
1. Star model:
a fact table, multiple dim tables.
J head office also has an abnormal version of the star model

2. Snowflake model:

3. Constellation model:

1. The star model is as follows:

  • depotitem document fact table: this table is ruozedata_depotitem; specification: a fact table, multiple dim tables (dimension tables)
  • Fact table: Order facts (facts), dim table: time dimension table (table whose data does not change, once createtime is generated, it will never change), supplier dimension table (slowly changing table, such as The name of the supplier, the bank card number of the supplier changes, and the contact information of the supplier also changes), the warehouse dimension table
    Insert picture description here
Perverted star model:

fact_depotitem document fact table: head main table + item list-> fact_depotitems, is a table after two tables are joined

2.2, data modeling-snowflake model

  • It is still a fact table. Multiple dim tables are an extension of the star model. The difference is that the dimension tables are normalized and further decomposed into additional tables; analogy to the erp database tables: ruozedata_material and ruozedata_materialcategory tables;
    Insert picture description here
    dimension table design conforms to 3NF Normalization can effectively reduce data redundancy, but the performance is relatively poor, it is also complicated, and the degree of parallelism is low.

2.3, data modeling-constellation model

  • The constellation model is an extension of the star model, is based on multiple fact tables, and share dimensional information.
    Insert picture description here

The star model does not conform to the 3NF design, denormalization, and there is no gradual dimension. The data is redundant at the cost of storage space, reducing the number of dimension table connections and improving performance. . For example, geographical table, there is data:

Product Name Category 1 Category 2
Fresh Fruit Tropical Fruit Mango
Fresh Fruit Tropical Fruit Dragon Fruit

Data redundancy here refers to: the first category and the second category are stored twice, although the redundancy is here but the performance is the best;

Product Name Category ID
Mango 25
Dragon Fruit 25

This table is equivalent to joining yourself in MySQL, the child-parent table:

Category ID name Parent ID
25 tropical fruit 18
25 tropical fruit 18
18 fresh fruits -1
18 fresh fruits -1

The constellation model also does not comply with 3NF, the business is complicated, the development is also complicated, and the performance is not high.

Dimensionality reduction: Commodity + Commodity category table-> Commodity table is useless, the theoretical knowledge of the data warehouse is basically useless in actual combat;

3. ERP project structure

Store multi-dimensional minimum granularity summary
Insert picture description here

3.1 Layered flow chart of data warehouse

ods commodity table + ods commodity category table-> dwd_sku (two tables combined into one table, dimension reduction), generally not used, because the performance can not be improved much.

Regarding how to find the common fields, which fields of which tables are often used by operations and qa personnel; the data warehouse is for others to use, communicate in advance;

The fact table is also divided into a detailed fact table and an aggregate fact table:

Insert picture description here

The data in the normal DWS layer is directly grouped to the ADS layer, but there are some fields that require no indicators in the DWS layer and need to be associated with the dim dimension table to join;

MySQL can do TopN, row-to-column, column-to-row, and window functions; all of these can be done in MySQL. Why should we introduce a big data framework?

The upstream data query does not move or cannot be calculated at a single point. ERP mysql: a string of sql join group by
Hive: join-> group by

Or the question mentioned in the last course: direct ODS layer results, write to ADS layer?

  • The tiering of the data warehouse is not as good as possible. A reasonable tier design, as well as a balance between calculation costs and labor costs, is a good performance of the data warehouse architecture:

For example: ODS layer data: The strength of 5min is 200 ~ 300G. In this case, the data can be overwhelmed if it is transferred to the DWD layer; ODS layer data-"DWD-" DWS layer is 3 data, if multiplied by the number of 3 copies , Which is equivalent to 9 copies of the data; if the data volume is large, the layering not only takes up space, but the calculation will not be able to keep up with it; for example: the live broadcast data of Kuaishou, Chedi, Provincial Satellite TV is indeed so large.

For example, the data is in the ODS layer 1T,-> DWD layer-> then the DWS layer data is 9T; in the case of three copies, each layer stores 3 copies;

So we need to assess what the total amount of data in the upstream is G, and how much is the growth in one year?
How to evaluate: Take a table with the largest number of fields: Take out the total number of entries in 2019 according to the createtime field, take out the size of one of the fields, the total number of entries x the size of a piece of data; for example, estimate a table 1G, there are 100 tables , Multiplied by a factor: 1G 100 3 times = 300g

Published 23 original articles · praised 0 · visits 755

Guess you like

Origin blog.csdn.net/SparkOnYarn/article/details/105454067