E-commerce data warehouse project

This article is for reference only, forwarded from https://blog.csdn.net/a1786742005/article/details/105833521

1. The overall structure of the project

Insert picture description here

2. Data description

2.1 User behavior data

1. Start log data
is a single json data
2. Event log data
Composition: timestamp, common fields, event log
Event:
(1) Product list
(2) Product click
(3) Product details
(4) Advertisement
(5) Message Notification
(6) Active user background
(7) Comment
(8) Favorite
(9) Like
(10) Error log

2.2 Business data

1. Order table
2, Order details table
3, sku product table
4, user table
5, product level one classification table
6, product level two classification table
7, product three level classification table
8, payment flow table
9, province table
10, Region Table
11, Brand Table
12, Order Status Table
13, spu Product Table
14, Product Review Table
15, Return Form
16, Additional Purchase Table
17, Product Collection Table
18, Coupon Receipt Table
19, Coupon Table
20, Activity table
21, activity order association table
22, preferential rule table
23, coding dictionary table
24, activity participation product table
25, schedule
26, holiday table
27, holiday chronology

Three, data import hdfs

3.1 User behavior data

1. User behavior data generation

2. Log collection flume configuration
(1) Source is TaildirSource.
(2) The channel is a Kafka channel, so sink is omitted.
(3) Custom flume interceptor.
Customized two interceptors, namely: ETL interceptor and log type distinguishing interceptor.
ETL Interceptor: Filter logs with illegal timestamps and incomplete Json data.
Log type distinction interceptor: Separate the startup log and the event log, which is convenient for sending to different topics of Kafka.

3. Log consumption Flume configuration
(1) The source is kafkaSource.
(2) The channel is fileChannel.
(3) Sink is the two paths on hdfs, compressed with lzo.

3.2 Business data

1. The process
sqoop imports the mysql table to hdfs and also uses lzo to compress it.

2. Data synchronization strategy
(1) Introduction to
data synchronization strategy The types of data synchronization strategy include: full scale, incremental table, new and changed table.
Full scale: store complete data.
Incremental table: stores newly added data.
New and changed tables: store newly added data and changed data.

(2) Full volume synchronization strategy
Daily full volume means that a complete data is stored every day as a partition.
It is suitable for scenarios where the amount of table data is not large, and new data is inserted every day, and old data is also modified.
For example: coding dictionary table, brand table, product three-level classification, product two-level classification, product one-level classification, discount rule table, activity table, event participation product table, additional purchase table, product collection table, coupon table, sku product Table, spu commodity table.
Insert picture description here

(3) Incremental synchronization strategy
Daily increment means to store one piece of incremental data every day as a partition.
It is suitable for scenarios where the amount of table data is large and only new data is inserted every day. For example: return form, order status table, order detail table, activity and order association table, product review table.
Insert picture description here

(4) New addition and change strategy
Daily addition and change means that the storage creation time or operation time is today's data.
The applicable scenario is: the amount of data in the table is large, and there will be new additions and changes.
For example: user table, order table, coupon requisition table.

(5) Special strategy
Some special dimension tables do not need to follow the above synchronization strategy.
A. Dimensions
of the objective world The dimensions of the objective world that have not changed (such as gender, region, ethnicity, political composition, shoe size) can only have a fixed value.
B. Date dimension The
date dimension can import one year or several years of data at once.
C. Region dimension:
Province table, region table

3. Icon
Insert picture description here

Fourth, the basic knowledge of data warehouse

4.1 Why do we need to count warehouses by layer?

1. Simplify complex problems
. Decompose complex tasks into multiple layers to complete. Each layer only handles simple tasks to facilitate problem location.

2. Reduce repetitive development.
Standardize data layering. Through the data of the middle layer, repetitive calculation can be greatly reduced and the reusability of calculation results can be increased.

3. Isolate the original data
Whether it is data anomalies or data sensitivity, decoupling real data from statistical data.

4.2 Graphical hierarchical structure

Insert picture description here

4.3 Data warehouse naming convention

1. The table name
ods layer is named ods_ table name
dwd layer is named dwd_dim/fact_ table name
dws layer is named dws_ table name
dwt layer is named dwt_ table name
ads layer is named ads_ table name
temporary table is named xxx_tmp
user behavior table, with log as the suffix

2. Script naming The
data source_to_target_db/log.sh
user behavior script uses log as the suffix, and the business data script uses db as the suffix.

4.4 Dimensional modeling

On the basis of dimensional modeling, there are three models: star model, snowflake model, and constellation model.
1. Star model
Insert picture description here

2. Snowflake model

Insert picture description here

3. Constellation model
Insert picture description here

4. Model selection
Insert picture description here

4.5 Dimension table and fact table

1. Dimension table
(1) Introduction to dimension table
Generally, it is the description information of facts. Each dimension table corresponds to an object or concept in the real world. For example: user, product, date, region, etc.
(2) The characteristics
of the dimension A, the range of the dimension is very wide (with multiple attributes, more columns)
B, compared with the fact table, the number of rows is relatively small, usually <100,000
C, the content is relatively fixed: coding table

2. Fact table
(1) Introduction to the
fact table Each row of data in the fact table represents a business event (order, payment, refund, evaluation, etc.). The term "fact" refers to the measurement value of a business event (count, count, number of items, amount, etc.), for example, the order amount in an order event.

(2) The fact table includes
the rows of each fact table, including: additivity numerical measures, foreign keys connected to dimensions, usually two or more foreign keys, and representation between foreign keys The many-to-many relationship between dimension tables.

(3) Feature
A of the fact table , very large
B, relatively narrow content, few columns
C, frequent changes, and a lot of new additions every day

3. Fact table classification
(1) Transactional fact table
takes each transaction or event as a unit, such as a sales order record, a payment record, etc., as a row of data in the fact table. Once the transaction is submitted and the fact table data is inserted, the data is no longer changed, and its update method is incremental update.

(2) Periodic snapshot fact table The
periodic fact table does not retain all data, only data at fixed intervals, such as daily or monthly sales, or monthly account balances.

(3) Cumulative snapshot fact table The
cumulative snapshot fact table is used to track changes in business facts. For example, the data warehouse may need to accumulate or store the time point data of various business stages from the time the order is placed to when the order items are packaged, transported, and signed to track the progress of the order declaration cycle. As this business process progresses, the records of the fact table must also be constantly updated.

4.6 Data Warehouse Modeling

1. ods layer
(1) Keep the original data without any modification, and play the role of data backup.
(2) The data is compressed to reduce the disk storage space (for example: original data 100G, can be compressed to about 10G).
(3) Create a partition table to prevent subsequent full table scans.

2. dwd layer The
dwd layer needs to build a dimensional model, generally a star model is used, and the state presented is generally a constellation model.
Dimensional modeling generally follows the following four steps:
select business process -> declare granularity -> confirm dimension -> confirm fact.

(1) Select business process
In the business system, select the business lines we are interested in, such as ordering business, payment business, refund business, and logistics business. One business line corresponds to a fact table.

(2) Declaring granularity The granularity of
data refers to the degree of refinement of the data stored in the data warehouse and the level of integrated procedures.
Declaring granularity means precisely defining what a row of data in the fact table represents. The smallest granularity should be chosen as much as possible to meet various needs.
A typical granularity statement is as follows: In an
order, each item is regarded as a row in the order fact table, and the granularity is every time an order is placed.
The number of orders per week is regarded as a row, and the granularity is the order placed every week.
The number of orders per month is regarded as a row, and the granularity is the order placed every month.

(3) Determine the dimension
The main function of the dimension is to describe the fact that the business is, and it mainly expresses "who, where, when" and other information.

(4) Determining
the facts The term facts here refers to the measurement values ​​in the business, such as order amount, number of orders, etc.
At the dwd layer, the business process is the modeling drive, and the most fine-grained detail-level fact table is constructed based on the characteristics of each specific business process. The fact table can be appropriately widened.
Insert picture description here

Insert picture description here

3. The dws layer Counts the
daily behavior of each subject object, serves the subject wide table of the dwt layer, and some business detailed data to meet special needs.
Daily equipment behavior, daily member behavior, daily product behavior, daily coupon statistics (reserved), daily activity statistics (reserved), daily purchase behavior (reserved).

4. The dwt layer
is model-driven by the subject object to be analyzed, and based on the application and product index requirements of the upper layer, a full-scale table of subject objects is constructed.
Device theme, member theme, product theme, coupon theme (reserved), event theme (reserved), purchase theme (reserved).

5. Ads layer
Analyze the major thematic indicators of the e-commerce system separately.

Five, ods layer architecture

https://blog.csdn.net/a1786742005/article/details/105868600

Six, dwd layer architecture

https://blog.csdn.net/a1786742005/article/details/105869203

Seven, dws layer architecture

https://blog.csdn.net/a1786742005/article/details/105896551

Eight, dwt layer architecture

https://blog.csdn.net/a1786742005/article/details/105903030

Nine, ads layer architecture

https://blog.csdn.net/a1786742005/article/details/105914300

X. Project summary

1. It is very important to build a data warehouse and do a good demand analysis in advance.
2. In the entire data warehouse, there are 25 tables in the ods layer, 26 tables in the dwd layer, 6 tables in the dws layer, 5 tables in the dwt layer, and 19 tables in the ads layer, totaling 81 tables.

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/113845673