Table of contents
1. The concept of data warehouse
2. Project requirements and architecture design
3. Cluster resource planning and design
4. Vehicle log field description
1. The concept of data warehouse
Data warehouse (Data Warehouse) is a tool that provides data support for enterprises to assist enterprises in making decisions, improving business processes, and improving product quality. It can receive various types of input data, such as business data, log data, and crawler data. However, in this project, we only perform statistics and analysis on log data.
Specifically, we will focus on a specific type of log data, sensor data while the car is running, which records the usage of each sensor and related data during the running of the car. This data is very important for us to improve car performance, diagnose problems, analyze driving behavior and so on.
2. Project requirements and architecture design
- Project requirements:
- Technology selection:
- Core architecture:
Mark the dimension information of the vehicle (full synchronization of fixed-point DataX), first upload the data to HFDS, and create a table mapping with Hive; upload the vehicle driving log to the data warehouse with Flume, save the data in ODS, and complete the public subquery in DWS. Finally, export the ADS to Mysql for machine learning.
- Frame version selection
Apache framework version used in this project:
- Server selection
- cluster size
3. Cluster resource planning and design
In an enterprise, a set of production clusters and a set of test clusters are usually built. The production cluster runs production tasks, and the test cluster is used for code writing and testing before going live.
- production cluster
Refer to the official recommended deployment of Tencent Cloud EMR
- Master node: a management node to ensure the normal scheduling of the cluster; it mainly deploys processes such as NameNode, ResourceManager, and HMaster; the number is 1 in non-HA mode, and 2 in HA mode.
- Core node: It is a computing and storage node. All your data in HDFS is stored in the core node. Therefore, in order to ensure data security, scaling down is not allowed after expanding the core node; mainly deploy processes such as DataNode, NodeManager, and RegionServer. The number is ≥2 in non-HA mode, and the number is ≥3 in HA mode.
- Common node: Provides data sharing synchronization and high-availability fault-tolerant services for the HA cluster Master node; mainly deploys distributed coordinator components, such as ZooKeeper, JournalNode and other nodes. The number in non-HA mode is 0, and the number in HA mode is ≥3.
Separate deployments that consume memory.
Data transmission data are put together closely (Kafka, clickhouse).
The client should be placed on one or two servers as much as possible to facilitate external access.
If there are dependencies, try to put them on the same server (for example: Ds-worker and hive/spark).
Master |
Master |
core |
core |
core |
common |
common |
common |
nn |
nn |
dn |
dn |
dn |
JournalNode |
JournalNode |
JournalNode |
rm |
rm |
nm |
nm |
nm |
|||
zk |
zk |
zk |
|||||
hive |
hive |
hive |
hive |
hive |
|||
kafka |
kafka |
kafka |
|||||
spark |
spark |
spark |
spark |
spark |
|||
datax |
datax |
datax |
datax |
datax |
|||
Ds-master |
Ds-master |
Ds-worker |
Ds-worker |
Ds-worker |
|||
mysql |
mysql |
||||||
flume |
flume |
flume |
Test cluster server planning
service name |
subservice |
server hadoop102 |
server hadoop103 |
server hadoop104 |
HDFS |
NameNode |
√ |
||
DataNode |
√ |
√ |
√ |
|
SecondaryNameNode |
√ |
|||
Yarn |
NodeManager |
√ |
√ |
√ |
Resourcemanager |
√ |
|||
Zookeeper |
Zookeeper Server |
√ |
√ |
√ |
Flume (collecting logs) |
Flume |
√ |
√ |
|
Kafka |
Kafka |
√ |
√ |
√ |
Flume (Consume Kafka logs) |
Flume |
√ |
||
Hive |
√ |
√ |
√ |
|
MySQL |
MySQL |
√ |
||
DataX |
√ |
√ |
√ |
|
Spark |
√ |
√ |
√ |
|
DolphinScheduler |
ApiApplicationServer |
√ |
||
AlertServer |
√ |
|||
MasterServer |
√ |
|||
WorkerServer |
√ |
√ |
√ |
|
LoggerServer |
√ |
√ |
√ |
|
Total number of services |
15 |
11 |
11 |
4. Vehicle log field description
All the data processed this time is vehicle log data, that is, the record of the vehicle's own state sent every 30 seconds during the driving process. In addition to log data, we also need to deal with vehicle dimension data, which is stored in the database.
Vehicle log data is critical to how we analyze and predict vehicle performance, maintenance needs, and problem diagnosis. Vehicle dimension data, on the other hand, provides additional information about the vehicle, such as production date, make and model, etc., which can help us better understand the performance and characteristics of the vehicle. In this data processing, we will process both types of data.
vehicle log data
The vehicle log data is a text file in Json format. Each line is a complete Json string, and the meanings of the fields are as follows:
field name |
Field Type |
vin |
Vehicle unique code |
timestamp |
Log collection time |
car_status |
vehicle status |
charg_status |
charging |
execution_mode |
运行模式 |
velocity |
车速 |
mileage |
里程 |
voltage |
总电压 |
electric_current |
总电流 |
soc |
SOC |
dc_status |
DC-DC状态 |
gear |
挡位 |
insulation_resistance |
绝缘电阻 |
motor_count |
驱动电机个数 |
motor_list |
驱动电机列表 |
fuel_cell_voltage |
燃料电池电压 |
fuel_cell_current |
燃料电池电流 |
fuel_cell_consume_rate |
燃料消耗率 |
fuel_cell_temperature_probe_count |
燃料电池温度探针总数 |
fuel_cell_temperature |
燃料电池温度值 |
fuel_cell_max_temperature |
氢系统中最高温度 |
fuel_cell_max_temperature_probe_id |
氢系统中最高温度探针号 |
fuel_cell_max_hydrogen_consistency |
氢气最高浓度 |
fuel_cell_max_hydrogen_consistency_probe_id |
氢气最高浓度传感器代号 |
fuel_cell_max_hydrogen_pressure |
氢气最高压力 |
fuel_cell_max_hydrogen_pressure_probe_id |
氢气最高压力传感器代号 |
fuel_cell_dc_status |
高压DC-DC状态 |
engine_status |
发动机状态 |
crankshaft_speed |
曲轴转速 |
fuel_consume_rate |
燃料消耗率 |
max_voltage_battery_pack_id |
最高电压电池子系统号 |
max_voltage_battery_id |
最高电压电池单体代号 |
max_voltage |
电池单体电压最高值 |
min_temperature_subsystem_id |
最低电压电池子系统号 |
min_voltage_battery_id |
最低电压电池单体代号 |
min_voltage |
电池单体电压最低值 |
max_temperature_subsystem_id |
最高温度子系统号 |
max_temperature_probe_id |
最高温度探针号 |
max_temperature |
最高温度值 |
min_voltage_battery_pack_id |
最低温度子系统号 |
min_temperature_probe_id |
最低温度探针号 |
min_temperature |
最低温度值 |
alarm_level |
最高报警等级 |
alarm_sign |
通用报警标志 |
custom_battery_alarm_count |
可充电储能装置故障总数N1 |
custom_battery_alarm_list |
可充电储能装置故障代码列表 |
custom_motor_alarm_count |
驱动电机故障总数N2 |
custom_motor_alarm_list |
驱动电机故障代码列表 |
custom_engine_alarm_count |
发动机故障总数N3 |
custom_engine_alarm_list |
发动机故障代码列表 |
other_alarm_count |
其他故障总数N4 |
other_alarm_list |
其他故障代码列表 |
battery_count |
单体电池总数 |
battery_pack_count |
单体电池包总数 |
battery_voltages |
单体电池电压值列表 |
battery_temperature_probe_count |
单体电池温度探针总数 |
battery_pack_temperature_count |
单体电池包总数 |
battery_temperatures |
单体电池温度值列表 |
其中电机列表为嵌套字段,其含义如下:
字段名 |
字段说明 |
id |
驱动电机序号 |
status |
驱动电机状态 |
controller_temperature |
驱动电机控制器温度 |
rev |
驱动电机转速 |
torque |
驱动电机转矩 |
temperature |
驱动电机温度 |
voltage |
电机控制器输入电压 |
electric_current |
电机控制器直流母线电流 |
车辆维度数据
字段名 |
字段说明 |
id |
车辆唯一编码 |
type_id |
车型ID |
type |
车型 |
sale_type |
销售车型 |
trademark |
品牌 |
company |
厂商 |
seating_capacity |
准载人数 |
power_type |
车辆动力类型 |
charge_type |
车辆支持充电类型 |
category |
车辆分类 |
weight_kg |
总质量(kg) |
warranty |
整车质保期(年/万公里) |
本项目参考尚硅谷课程:
【尚硅谷大数据项目之新能源汽车数仓,离线数据仓库项目实战】 https://www.bilibili.com/video/BV1uF411o74x/?p=7&share_source=copy_web&vd_source=2d7beee727c4b0510439779fd78c22f7
附录: 基于Stable Diffusion生成的新能源Tesla。