Actual combat-data warehouse construction (1)

Question guide:
1. What is the architecture of the data warehouse?
2. How to select technology?
3. How to design the system data flow?
4. How to select the server?

1. Data Warehouse

Data Warehouse (Data Warehouse) is a strategic collection that provides all system data support for all decision-making processes of the enterprise.
Through the analysis of the data in the data warehouse, it can help enterprises to improve business processes, control costs, and improve product quality. Waiting for the
data warehouse is not the final destination of the data, but to prepare for the final destination of the data. These preparations include data cleaning, escaping, classification, reorganization, merging, splitting, statistics, etc.




2. Project requirements
1. User behavior data collection platform construction
2. Business data collection platform construction
3. Data warehouse dimensional modeling
4. Analyze the core e-commerce themes such as users, traffic, members, commodities, sales, regions, activities, and statistical report indicators.
5. Use ad hoc query tools to perform indicator analysis at any time.
6. Monitor cluster performance. If abnormalities occur, alarms are required
7. Metadata management
8. Quality monitoring


3. Technology selection
[1] Questions
1. How to select the project technology?
2. How to
choose the framework version (Apache, CDH, HDP) 3. Use a physical machine or a cloud host for the server?
4. How to confirm the cluster size? (Assuming 8T hard disk for each server)
[2] Main considerations for technology selection

  • Data size
  • Business needs
  • Industry experience
  • Technology maturity
  • Development and maintenance costs
  • Total cost budget

[3] Technology used
1. Data collection and transmission: Flume, Kafka, Sqoop, Logstash, DataX
2. Data storage: MySql, HDFS, HBase, Redis, MongoDB
3. Data calculation: Hive, Tez, Spark, Flink, Storm
4 , Data query: Presto, Druid, Impala, Kylin
5. Data visualization: Echarts, Superset, QuickBI, DataV
6. Task scheduling: Azkaban, Oozie
7. Cluster monitoring: Zabbix
8. Metadata management: Atlas
9. Data quality monitoring: Griffin
 


4. System data flow design
data source:

  • Embed user behavior data The data generated by the
    user in the process of using the product and interacting with the client product, such as page views, clicks, stays, comments, likes, favorites, etc.
  • Business interaction data The data
    related to login, order, user, product, payment, etc. generated in the business process are usually stored in DB, including Mysql, Oracle, etc.

Architecture diagram:




V. Framework version selection
1) How to choose Apache/CDH/HDP version?

  • (1) Apache: Operation and maintenance is troublesome, and compatibility between components needs to be investigated by yourself. (Generally used by large factories, with strong technical strength and professional operation and maintenance personnel) (recommended)
  • (2) CDH: The most widely used version in China, but CM is not open source. Charges will be charged starting this year, 10,000 US dollars per node
  • (3) HDP: Open source, can be used for secondary development, but not stable without CDH, less domestic use

2) Apache framework version

3) CDH framework version: 5.12.1



VI. Server selection
1) The physical machine
has 128G memory, 20-core physical CPU, 40 threads, 8THDD and 2TSSD hard drives, and a single Dell brand quotes 4W. Generally, the life of a physical machine is about 5 years, and
professional operation and maintenance personnel are needed. The average is 10,000 a month. The electricity bill is also a lot of overhead.
2) The cloud host
takes Alibaba Cloud as an example. It has almost the same configuration. A
lot of 5W operation and maintenance work is done by Alibaba every year. Cloud is completed, operation and maintenance is relatively easy
3) Enterprise choice
1. Financially rich companies and companies that do not directly conflict with Ali choose Alibaba Cloud
2. Small and medium-sized companies choose Alibaba Cloud for financing and listing, and buy physical machines after pulling down financing
3. Yes Long-term plan, with sufficient funds, choose physical machines.

7. Cluster scale
1) How to confirm the cluster scale? (Assumption: 8T disk and 128G memory for each server)

  • 1. 1 million daily active users per day, with an average of 100 per person per day: 1 million*100 = 100 million
  • 2. Each log is about 1K, 100 million per day: 100000000/1024/1024=about 100G
  • 3. If the server is not expanded within half a year: 100G*180 days=about 18T
  • 4. Save 3 copies: 18T*3=54T
  • 5. Reserve 20%~30% Buf=54T/0.7=77T
  • 6. Calculate this: about 8T*10 servers

2) How to consider layering of data warehouses? Is the data compressed? Need to recalculate
3) Test server planning

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/108579473