12: Project Summary: Background Requirements
- Goal : Master the project background and project requirements of one-stop manufacturing
- path
- step1: industry background
- step2: Project requirements
- implement
- Project Industry : Industrial Internet Big Data: Internet of Things
- Project name : Gas station service provider data operation management platform
- Refer to other projects: commercial big data analysis platform: Sensors
- Company products : Fuel dispenser equipment service
- Corporate customers : Sinopec, PetroChina, CNOOC, Shell, Total...
- overall demand
- Requirement 1: Improve the service quality of the company's products through data analysis
- Statistical analysis based on the data of equipment installation, maintenance, inspection and transformation of gas stations
- Data analysis of the call center supporting the equipment maintenance requirements of the gas station site and after-sales service
- Requirement 2: Support the company's cost operation and accounting through data analysis
- Guarantee the warehousing and logistics of parts and the needs of the supply chain
- Realize all cost operation accounting in the service process
- Requirement 3: Prepare data for future automated tanker equipment
- Get all user and vehicle information to realize automatic refueling management
- Requirement 1: Improve the service quality of the company's products through data analysis
- specific requirement
- Operation analysis: number of call center service orders, number of equipment work orders, number of participating service engineers, parts consumption and supply indicators, etc.
- Equipment analysis: equipment oil volume monitoring, equipment operation status monitoring, number of installations, inspection times, maintenance times, and transformation times
- Call center: the number of calls, the total number of work orders, the total number of dispatch orders, the total number of completions, and the number of verification orders
- Employee analysis: the number of personnel, the number of orders received, the number of evaluations, and the number of business trips
- Cost analysis: warehouse material management analysis, user analysis
- summary
- Master the project background and project requirements of one-stop manufacturing
13: Project Summary: Data Sources
-
Goal : Master the business process and data sources of one-stop manufacturing projects
-
path
- step1: business process
- step2: data source
-
implement
-
Business Process
- Step1: The gas station service provider contacts the call center and applies for services : installation/inspection/maintenance/modification of the fuel dispenser
- The call center will record this application information: Incoming Call Acceptance Transaction Fact Sheet
- Step2: The call center contacts the corresponding service site and assigns a work order : contact the site supervisor, and the site supervisor assigns service personnel
- Work order information is recorded in: service order information table, work order information table
- step3: The service personnel confirm the work order and refueling station information
- Specific work order information table: installation order, maintenance order
- step4: The service personnel arrive at the gas station on the specified date for equipment maintenance
- step5: If it is an installation or inspection service, and the installation or inspection is successful, the service is complete
- Step6: If it is for maintenance or transformation service, it is necessary to apply for materials from the service site , when the materials arrive and the implementation is completed, the service is completed
- step7: After the service is completed, confirm the end of the service with the gas station site service provider and complete the order verification
- step8: Expenses incurred during the engineer's reimbursement process
- Records of all reimbursement expenses: travel expense information sheet, expense schedule
- step9: The call center will regularly make return visits to the service of the engineer in the work order
- return visit information sheet
- Step1: The gas station service provider contacts the call center and applies for services : installation/inspection/maintenance/modification of the fuel dispenser
-
Data Sources
-
ERP system : Enterprise resource management system, which stores information on all resources of the entire company
- All engineers, items, equipment product supply chain, production, sales, and financial information are in the ERP system
-
CISS system : customer service management system, storing all user and operation data
- Work order information, user information
-
Call center system : responsible for the realization of all customer demand applications, scheduling, return visits, etc.
- Call information, assignment information, return visit information
-
core data table
- Operational Analysis
- Work order analysis, installation analysis, maintenance analysis, inspection analysis, transformation analysis, call acceptance analysis
- Improve service quality
- Return visit analysis
- Operating Costing
- Revenue, Support Analysis
- Operational Analysis
-
-
-
summary
- Master the business process and data sources of one-stop manufacturing projects
14: Project Summary: Theme Division
-
Goal : Master the subject division of one-stop manufactured items
-
implement
- service domain
- Installation Topics: Installation Method, Payment Fee, Installation Type
- Work order subject: dispatch method, total number of work orders, dispatch type, total number of work completed,
- Repair Topics: Payment Charges, Parts Charges, Failure Types
- Dispatch topics: number of dispatch orders, average dispatch order, dispatch order response time
- Expense Topics: Travel Expenses, Installation Expenses, Reimbursement Personnel Statistics
- Topics of return visits: Number of return visits, return work order status
- Gas station theme: total number of gas stations, newly added gas stations
- client domain
- Customer theme: number of installations, number of repairs, number of inspections, number of return visits
- storage domain
- Subject of write-off of good products under warranty: quantity of write-off, amount of spare parts
- Subject of write-off of defective products under warranty: number of parts to be written off, amount of parts to be written off after verification
- Repair theme: repair application, repair material quantity, repair type
- Transfer topic: transfer status, transfer quantity, transfer device type
- Write-off of consumables: total write-off, write-off equipment type
- provider domain
- Work order topic: dispatch method, total number of work orders, work order type, customer type
- Service Provider Gas Station Theme: Number of Gas Stations, New Number of Gas Stations
- Operation domain
- Operational topics: service staff hours, repair station analysis, average work order, network distribution
- market domain
- Market topics: work order statistics, completion details, order statistics
- service domain
-
summary
- Master the subject division of one-stop manufacturing projects
15: Project Summary: Technical Architecture
-
Goal : Master the technical architecture of one-stop manufacturing projects
-
implement
-
Data Generation : Business Database System
- Oracle: work order data, material data, service provider data, reimbursement data, etc.
-
data collection
- Sqoop: offline database acquisition
- How does Sqoop collect Oracle data
- Sqoop: offline database acquisition
-
data storage
- Hive [HDFS]: offline data warehouse [table]
-
data calculation
- SparkSQL: HiveSQL-like development method: processing and analyzing structured data in the data warehouse
- Python | Java: SparkSQLDSL development: use spark-submit to submit and run
- SparkSQL SQL + ThriftServer: submit SQL development
- SparkSQL: HiveSQL-like development method: processing and analyzing structured data in the data warehouse
-
data application
- MySQL: result storage
- FineBI/Tableau: Visualization Tools
-
monitoring tool
- Prometheus: server performance indicator monitoring tool
- Grafana: monitoring visualization tool
-
scheduling tool
- AirFlow: Task Flow Scheduling Tool
-
Technology Architecture
-
-
summary
- Master the technical architecture of one-stop manufacturing projects
16: Project Summary: Data Warehouse Design
-
Goal : Master the hierarchical design and modeling design of one-stop manufacturing projects
-
path
- step1: layered design
- step2: modeling design
-
implement
-
layered design
- ODS : Raw data layer: the layer closest to the original data, direct collection and writing layer: original transaction fact table
- Data content: Store all original business data, which is basically consistent with the business data in the Oracle database
- Data source: Synchronous collection from Oracle using Sqoop
- Storage design: Hive partition table, stored in avro file format, reserved for 3 months
- DWD : detailed data layer: the result after ETL is realized for the data of ODS layer according to business requirements: transaction fact table after ETL
- Data content: store detailed data of all business data
- Data source: Obtained by ETL flattening of ODS layer data
- Storage design: Hive partition table, orc file format storage, retain all data
- DWB : basic data layer: similar to the DWM explained before, lightly aggregated
- Association: associate the tables of subject facts, and merge all fields related to this subject into one table
- Aggregation: Build underlying metrics based on transactional facts about topics
- Subject Matter Fact Sheet
- Data content: Store data such as basic associations between all facts and dimensions, basic fact indicators, etc.
- Data source: data after cleaning, filtering and light aggregation of DWD layer data
- Storage design: Hive partition table, orc file format storage, retain all data
- ST : data application layer: similar to the APP explained before, storing the results of each topic based on dimensional analysis aggregation: periodic snapshot fact table
- Reports for data analysis
- Data content: store the factual data of all report analysis
- Data source: Based on the DWB and DWS layers, the indicators of all report facts are obtained through statistical aggregation of different dimensions
- DM : Data Mart: According to the data needs of different departments, there will be no data storage for actual subject needs for the time being
- Do departmental data archiving to facilitate the iterative development of new business requirements in the future
- Data content: store data of different subjects required by different departments
- Data source: Aggregation and statistics of DW layer data are divided according to different departments
- DWS : dimension data layer: similar to the previously explained DIM: storing dimension data tables
- Data content: store dimension data of all businesses: date, region, gas station, call center, warehouse and other dimension tables
- Data source: Dimensional data extracted from DWD detailed data
- Storage design: Hive ordinary table, orc file + Snappy compression
- Features: small quantity, few changes, full collection
- Data Warehouse Design
- From top to bottom: Online education: first clarify the needs and themes, then collect and process data based on the needs of the themes
- Scenario: There are fewer data applications and relatively simple requirements
- From bottom to top : One-stop manufacturing: Unify all the data of the entire company in the data warehouse for storage preparation, and dynamically and directly obtain data according to future needs
- Scenario: There are many data applications and the business is complex
- From top to bottom: Online education: first clarify the needs and themes, then collect and process data based on the needs of the themes
- ODS : Raw data layer: the layer closest to the original data, direct collection and writing layer: original transaction fact table
-
modeling design
-
Modeling Method: Dimensional Modeling
-
Dimensional Design: Star Schema
-
Common Dimensions
- datetime dimension
- Year Dimension, Quarter Dimension, Month Dimension, Week Dimension, Day Dimension
- Day-to-day ratio, week-to-week ratio, month-to-month ratio, day-to-day ratio, week-to-week ratio, month-to-month ratio
- Ring-to-ring: comparisons within the same period
- Year-on-year: comparison of the previous period
- Administrative area dimension
- Regional level: country dimension, province dimension, city dimension, county dimension, township dimension
- Dimensions of service outlets
- Branch name, branch number, province, city, county, affiliated institution
- Gas station dimension
- Gas station type, gas station name, gas station number, customer number, customer name, province, city, county, gas station status, affiliated company
- Organization Dimension
- Personnel No., Personnel Name, Post No., Post Name, Department No., Department Name
- service type dimension
- Type number, Type name
- Device Dimensions
- Equipment type, equipment number, equipment name, number of oil guns, pump type, software type
- Failure Type Dimension
- First-level failure number, first-level failure name, second-level failure number, and second-level failure name
- Logistics company dimension
- Logistics company number, logistics company name
- datetime dimension
-
Subject Dimension Matrix
-
-
- summary
- Master the hierarchical design and modeling design of one-stop manufacturing projects
17: Project Summary: Optimization and New Features
-
Goal : Master the optimization scheme in the one-stop manufacturing project
-
implement
-
Optimization : Refer to the optimization document in "Employment Interview" in FTP
-
Resource optimization: enable attributes to allocate more resources, and allocate memory reasonably
-
Development optimization: predicate pushdown: try to filter out unnecessary data in advance [join]
- Try to choose an operator with map-side aggregation: first aggregate within the partition, and then aggregate between partitions
- Try to filter the data that does not need to join, or implement Broadcast Join
-
Structure optimization: file storage type, partition structure
-
Partition table: static partition pruning
select count(*) from table1 where daystr = '2021-10-15'; --走分区裁剪过滤查询
--spark2中先join后过滤 select * from table1 join table2 on table1.id = table2.id and table1.daystr = '2021-10-15' and table2.daystr='2021-10-15';
-
-
-
New features: Spark3.0
-
Dynamic Partition Pruning
-
The default partition pruning is only valid for single-table query filtering
-
Enable dynamic partition pruning: automatically query and filter the data of both tables according to conditions during Join, and then join the filtered results
spark.sql.optimizer.dynamicPartitionPruning.enabled=true
-
-
Adaptive Query Execution
-
Based on CBO optimizer engine: data processing with minimum cost
-
Automatically set the number of Reducers [ShuffleRead] according to statistical information to avoid waste of memory and I/O resources
-
Automatically select a better join strategy to improve connection query performance
-
Automatically optimize join data to avoid data skew caused by unbalanced queries, and automatically repartition data with skewed data
spark.sql.adaptive.enabled=true
-
-
-
-
summary
- Master the optimization scheme in the one-stop manufacturing project
18: Project Summary: Questions
-
Goal : Master the problems and solutions encountered in the one-stop manufacturing project
-
implement
-
Problem 1: Inconsistency in data collection
- Phenomenon: The number of records in the Hive table is inconsistent with the number of records in Oracle
- Reason: Oracle's data field contains special fields. When Sqoop collects, special characters are used as line breaks to generate ordinary text
- solve
- Solution 1: Replace or delete special fields [does not affect data services]
- Solution 2: Replace the Avro format
-
Problem 2: Data Skew Problem
- Repartitioning: redistribute data into more partitions
- Custom partition method: default Hash partition [reduceByKey], Range partition [sortBy]
- Filter first and then join, or use broadcast join
-
Problem 3: Small file problem
- Each Task will generate a result file
- The number of tasks is determined by the number of partitions
- More partitions, less data per partition
- Adjust the number of partitions: repartion
-
Question 4: Insufficient ThriftServer resources, GC problem
start-thriftserver.sh \ --name sparksql-thrift-server \ --master yarn \ --deploy-mode client \ --driver-memory 1g \ --hiveconf hive.server2.thrift.http.port=10001 \ --num-executors 3 \ --executor-memory 1g \ --conf spark.sql.shuffle.partitions=2
- Essence: Spark program runs on YARN
- Process: Driver + Executor
- Problem: If the resources of this program are given less, it will cause GC [memory garbage collection] pause and memory overflow
- The Driver process is faulty, the program runs slowly, and the memory overflows
- solve
- Driver resources need to be given more: Driver runs persistently, continuously analyzes scheduling assignments, and is responsible for interacting with clients
- –driver-core:4core
- –driver-mem:12GB
- The number of Executors is given more
- Driver resources need to be given more: Driver runs persistently, continuously analyzes scheduling assignments, and is responsible for interacting with clients
-
Question 5: ThriftServer single point of failure
- Similar to the single point of failure problem of HiveServer2
- Solution: HA high availability structure, build two ThriftServer
- Solution 1: Start two ThriftServers on two machines respectively
- Question: beeline can only connect to one, who to connect? If you choose one at random, what should you do if it fails?
- Solution: HAproxy tool, operation and maintenance configuration
- Solution 2: Use ZK to implement auxiliary elections, one Active, one Standby
- Native HiveServer2 can directly modify the configuration to achieve
- Modify the source code
- Solution 1: Start two ThriftServers on two machines respectively
-
-
summary
- Master the problems and solutions encountered in one-stop manufacturing projects
19: Project Summary: Data Scale
- Goal : Master the scale of data in one-stop manufacturing projects
- implement
- What is the daily data increment?
- The number of total data tables in the project: more than 300 tables
- Transaction fact table of core business: 100 tables
- Fact increment for each core transaction: 170,000 items/day
- The average size of each piece of data: 1KB
- Total data incremental range per day: 16GB
- How many machines are there in the cluster?
- Storage capacity per machine: 20TB
- Available ratio of each machine: 80%
- Available capacity per machine: 16TB
- Overall data storage for five years: 16 * 3 * 365 * 5 = 6 DataNode/NodeManager
- Project team size?
- Take 12 people as an example: project manager: 1, product manager: 1, offline: 5 people, web system: 2 people, test: 2 people, operation and maintenance: 1 person
- What is the daily data increment?
- summary
- Master the scale of data in one-stop manufacturing projects
20: Project Summary: Resume Template
项目名称:一站制造大数据项目(2021年1月-2021年9月)
项目架构:
spark2.4+hive2.1+hadoop2.7+sqoop1.4+oracle11g+mysql5.7+airflow2.0
项目简介:
一站制造项目基于工业互联网行业,为解决基于传统数据存储架构无法解决的问题而开发的大数据项目。在石油制造行业存在大量运营、仓储物料数据,通过大数据技术架构解决这种复制业务情况下的数据存储和分析以及数据可视化问题。主要基于hive数据分层构建存储各个业务指标数据,基于sparksql做数据分析。核心业务涉及运营商、呼叫中心、工单、油站、仓储物料等业务。
个人职责:
1.负责将存储在关系型数据库中的业务系统数据导入hdfs上。
2.根据原始数据表,批量创建hive表,设置分区、存储格式。
3.根据业务关联关系以及分析指标,建立数仓模型。
4.实现数据模型中的各个数仓分层的数据建模,建表。
5.负责实现每个分层的数据抽取、转换、加载。
6.负责编写shell实现sqoop脚本批量导入数据。
7.负责编排sqoop导入数据的任务调度。
8.负责使用sparksql进行数据应用层指标进行分析。
- Application of industrial big data: https://zhuanlan.zhihu.com/p/166300187
- Companies in the oil and energy industry: https://top.chinaz.com/hangye/index_qiye_shihua.html
- Business data analysis platform: Youmeng, talkingdata, Sensors