E-commerce data warehouse project requirements and architecture design

1. Project requirements

1. Construction of user behavior data collection platform

2. Construction of business data collection platform

3. Data warehouse dimensional modeling

4. Statistical indicators

5. Impromptu query tool for index analysis at any time

6. Monitor cluster performance and alarm when abnormalities occur (third-party information)

7. Metadata Management

8.Quality control

9. Rights management (table level, field level)

 2. Technology selection

Data volume, business needs, industry experience, technology maturity, development and maintenance costs, and total cost budget

Data collection and transmission: Flume, Kafka, Sqoop, Logstash (log collection), DataX

Data storage: MySQL (ADS layer), HDFS, HBase, Redis, MongoDB

Data computing: Hive, Tez, Spark, Flink, Storm

Data query: Presto, Kylin, Impala, Druid, ClickHouse, Doris

Data visualization: ECharts, Superset (open source and free) , QuickBI (offline), DATAV (real-time) (Alibaba products)

Task scheduling: Azkaban, Oozie, DolphinScheduler, Airflow

Cluster monitoring: Zabbix (offline), Prometheus (real-time)

Metadata Management: Atlas

Permission management: Ranger, Sentry (Apache has removed them)

 3. System data flow processing

Nginx: load balancing, mainly responsible for balancing the data on each server

Mainly divided into business data  and user behavior data

Business data is stored in MySQL, and data is synchronized to the cluster through Sqoop.

User behavior data mainly comes from front-end buried points. The data is stored in the form of files. Log files are collected to Kafka through Flume (to avoid direct collection, prevent excessive data volume and peak elimination), and then synchronize the data to the cluster through Flume. HIVE On Spark performs data storage, cleaning, conversion and other operations, and divides the data into ODS data original layer, DWD data detail layer, DWS data service layer, DWT data subject layer, and ADS data application layer.

The ADS layer data is then synchronized to MySQL through Sqoop for visual analysis and display (Superset)

During the calculation process, DWD, DWS, and DWT layer data can be queried ad hoc through Presto.

Multi-dimensional analysis of DWD layer data can be performed through Kylin, and the results can be stored in HBase

Timing task scheduling tool can use Azkaban

Metadata management using Atlas

Permission management using Ranger

Data quality management using Python+Shell

Cluster monitoring using Zabbix

4. Framework release version selection and cluster size

Apache open source free

Cloud server: Alibaba Cloud EMR

                  Amazon Cloud EMR        

                  Tencent Cloud EMR

                  Huawei Cloud EMR

The choice of physical machine or cloud server is mainly based on the needs of the company.

Physical machine: The cost of space, electricity bills, machine maintenance, and later server operation and maintenance are high, and the security is relatively high.

Cloud server: high cost, but later maintenance is easier, and security is lower than physical machines

How to buy a server?

1 million daily users* 100 entries per person on average* Log size 1K* No expansion for half a year* 180* Three copies3* Reserve 20%~30%Buf = 77T

Then consider data warehouse layering, data compression, etc.

Cluster resource planning and design

Production cluster principles:

separate memory consumption

Data transfers are tightly put together

The client should be placed on one server as much as possible to facilitate external access (data security)

If there are dependencies, try to put them on one server

Test cluster:

Guess you like

Origin blog.csdn.net/GX_0824/article/details/132566416