Monitor 系统监控
目录
基本概念
整体分类
该模块包含用于检查硬件状态和监视系统运行状况的代码等系统级软件。
在Apollo 5.5中,监视模块现在执行以下检查:
- 运行模块状态
- 监控数据完整性
- 监控数据频率
- 监视系统运行状况(例如CPU、内存、磁盘使用情况等)
- 生成端到端延迟统计报告
前三项功能是可以自主配置的。
属性分类
从属性上来分类的话,apollo 的monitor 基本上可以分为硬件状态监控和软件状态监控。
硬件状态监控基本上可以分为:
- GPS
- Resource
- ESD-CAN
- Socket-CAN
软件状态监控可以分为:
- Channel Status
- Functional safety Status
- Latency status
- Localization status
- Module status
- Process status
- recorder status
然后由Summary 模块将上述的状态打包发出。
代码结构分析
├── BUILD ├── README.md ├── common │ ├── BUILD │ ├── monitor_manager.cc │ ├── monitor_manager.h │ ├── recurrent_runner.cc │ ├── recurrent_runner.h │ └── recurrent_runner_test.cc ├── hardware │ ├── BUILD │ ├── esdcan_monitor.cc │ ├── esdcan_monitor.h │ ├── gps_monitor.cc │ ├── gps_monitor.h │ ├── resource_monitor.cc │ ├── resource_monitor.h │ ├── socket_can_monitor.cc │ └── socket_can_monitor.h ├── monitor.cc ├── monitor.h ├── proto │ ├── BUILD │ └── system_status.proto └── software ├── BUILD ├── camera_monitor.cc ├── camera_monitor.h ├── channel_monitor.cc ├── channel_monitor.h ├── functional_safety_monitor.cc ├── functional_safety_monitor.h ├── latency_monitor.cc ├── latency_monitor.h ├── localization_monitor.cc ├── localization_monitor.h ├── module_monitor.cc ├── module_monitor.h ├── process_monitor.cc ├── process_monitor.h ├── recorder_monitor.cc ├── recorder_monitor.h ├── summary_monitor.cc └── summary_monitor.h
主要包含4个部分:
- component创建入口 :monitor.cc/.h
- common 公共基类
- hardware 硬件监控
- software 软件监控
整体逻辑分析
整体流程
Monitor 运行时,先扫描不同的子 Monitor,然后通过 SummaryMonitor 做整体状态的监控报告,产生 4 类状态:
- Fatal
- Error
- Warn
- OK
- Unkown
之后由FunctionalSafetyMonitor根据状态做两个行为:
- 通知驾驶员采取行动
- 如果预期的安全措施没有生效,触发Guardian模块(紧急停车)
代码分析
Monitor 类结构分析:monitor.h/.cc
class Monitor : public apollo::cyber::TimerComponent { public: bool Init() override; bool Proc() override; private: std::vector<std::shared_ptr<RecurrentRunner>> runners_; };
Monitor 是一个继承了TimerComponent的定时器组件,init要负责初始化,proc负责实际执行。
Monitor 初始化分析
MonitorManager::Instance()->Init(node_); // Only the one CAN card corresponding to current mode will take effect. runners_.emplace_back(new EsdCanMonitor()); runners_.emplace_back(new SocketCanMonitor()); // To enable the GpsMonitor, you must add FLAGS_gps_component_name to the // mode's monitored_components. runners_.emplace_back(new GpsMonitor()); // To enable the LocalizationMonitor, you must add // FLAGS_localization_component_name to the mode's monitored_components. runners_.emplace_back(new LocalizationMonitor()); // To enable the CameraMonitor, you must add // FLAGS_camera_component_name to the mode's monitored_components. runners_.emplace_back(new CameraMonitor()); // Monitor if processes are running. runners_.emplace_back(new ProcessMonitor()); // Monitor if modules are running. runners_.emplace_back(new ModuleMonitor()); // Monitor message processing latencies across modules const std::shared_ptr<LatencyMonitor> latency_monitor(new LatencyMonitor()); runners_.emplace_back(latency_monitor); // Monitor if channel messages are updated in time. runners_.emplace_back(new ChannelMonitor(latency_monitor)); // Monitor if resources are sufficient. runners_.emplace_back(new ResourceMonitor()); // Monitor all changes made by each sub-monitor, and summarize to a final // overall status. runners_.emplace_back(new SummaryMonitor()); // Check functional safety according to the summary. if (FLAGS_enable_functional_safety) { runners_.emplace_back(new FunctionalSafetyMonitor()); } return true;
runners_ 是 类中的一个成员容器:std::vector<std::shared_ptr<RecurrentRunner>> runners_。
init 函数流程:
- 利用当前node,对MonitorManger 进行初始化(node_是因为继承了component来的)
- 把下述monitor 放入容器:
-
- EsdCanMonitor
- SocketCanMonitor
- GpsMonitor
- LocalizationMonitor
- CameraMonitor
- ProcessMonitor
- ModuleMonitor
- LatencyMonitor
- ChannelMonitor
- ResourceMonitor
- SummaryMonitor
- 判断是不是enable 了functional_safety
-
- 如果是就把FunctionalSafetyMonitor也放入容器
- 返回ture
Monitor 执行函数分析
bool Monitor::Proc() { const double current_time = apollo::cyber::Clock::NowInSeconds(); if (!MonitorManager::Instance()->StartFrame(current_time)) { return false; } for (auto& runner : runners_) { runner->Tick(current_time); } MonitorManager::Instance()->EndFrame(); return true; }
流程
- 记录了当前时间
- MonitorManager 启动一个frame(启动一次监控任务),并传入当前时间
-
- 如果执行失败返回false
- 遍历runners容器,将里面的所有的monitor 执行tick函数(启动子监控的监控任务),并传入当前时间。
- MonitorManager 关闭frame。
- 返回true。