1. Project requirements
1. Construction of user behavior data collection platform
2. Construction of business data collection platform
3. Data warehouse dimensional modeling
4. Statistical indicators
5. Impromptu query tool for index analysis at any time
6. Monitor cluster performance and alarm when abnormalities occur (third-party information)
7. Metadata Management
8.Quality control
9. Rights management (table level, field level)
2. Technology selection
Data volume, business needs, industry experience, technology maturity, development and maintenance costs, and total cost budget
Data collection and transmission: Flume, Kafka, Sqoop, Logstash (log collection), DataX
Data storage: MySQL (ADS layer), HDFS, HBase, Redis, MongoDB
Data computing: Hive, Tez, Spark, Flink, Storm
Data query: Presto, Kylin, Impala, Druid, ClickHouse, Doris
Data visualization: ECharts, Superset (open source and free) , QuickBI (offline), DATAV (real-time) (Alibaba products)
Task scheduling: Azkaban, Oozie, DolphinScheduler, Airflow
Cluster monitoring: Zabbix (offline), Prometheus (real-time)
Metadata Management: Atlas
Permission management: Ranger, Sentry (Apache has removed them)
3. System data flow processing
Nginx: load balancing, mainly responsible for balancing the data on each server
Mainly divided into business data and user behavior data
Business data is stored in MySQL, and data is synchronized to the cluster through Sqoop.
User behavior data mainly comes from front-end buried points. The data is stored in the form of files. Log files are collected to Kafka through Flume (to avoid direct collection, prevent excessive data volume and peak elimination), and then synchronize the data to the cluster through Flume. HIVE On Spark performs data storage, cleaning, conversion and other operations, and divides the data into ODS data original layer, DWD data detail layer, DWS data service layer, DWT data subject layer, and ADS data application layer.
The ADS layer data is then synchronized to MySQL through Sqoop for visual analysis and display (Superset)
During the calculation process, DWD, DWS, and DWT layer data can be queried ad hoc through Presto.
Multi-dimensional analysis of DWD layer data can be performed through Kylin, and the results can be stored in HBase
Timing task scheduling tool can use Azkaban
Metadata management using Atlas
Permission management using Ranger
Data quality management using Python+Shell
Cluster monitoring using Zabbix
4. Framework release version selection and cluster size
Apache open source free
Cloud server: Alibaba Cloud EMR
Amazon Cloud EMR
Tencent Cloud EMR
Huawei Cloud EMR
The choice of physical machine or cloud server is mainly based on the needs of the company.
Physical machine: The cost of space, electricity bills, machine maintenance, and later server operation and maintenance are high, and the security is relatively high.
Cloud server: high cost, but later maintenance is easier, and security is lower than physical machines
How to buy a server?
1 million daily users* 100 entries per person on average* Log size 1K* No expansion for half a year* 180* Three copies3* Reserve 20%~30%Buf = 77T
Then consider data warehouse layering, data compression, etc.
Cluster resource planning and design
Production cluster principles:
separate memory consumption
Data transfers are tightly put together
The client should be placed on one server as much as possible to facilitate external access (data security)
If there are dependencies, try to put them on one server
Test cluster: