The wave of hardware monitoring (ipmitool, MegaCli)

The wave of hardware monitoring (ipmitool, MegaCli)

Why ipmitool and MegaCli tools to monitor?

Ipmitool information in the information and management card in the server wave of one to one, for example, look at the state of the fan on the management card value

ipmitool access to relevant information in the system fan

ipmitool sdr list | grep FAN[0-6]
FAN0_F_Speed     | 4224 RPM          | ok
FAN0_R_Speed     | 3744 RPM          | ok
FAN1_F_Speed     | 4224 RPM          | ok
FAN1_R_Speed     | 3744 RPM          | ok
FAN2_F_Speed     | 4320 RPM          | ok
FAN2_R_Speed     | 3840 RPM          | ok
FAN3_F_Speed     | 4224 RPM          | ok
FAN3_R_Speed     | 3840 RPM          | ok

From here, the data outputted to he can start reading and displaying the status of the fan.
monitoring difference dell fan and wave is monitoring if the read value dell fan anomaly shown in the status returned. In monitoring the wave is only monitor the online status of the fan. Read the value of the fan abnormality such as when reading the value of the fan is 200, but the fan still online. dell will have an abnormal state, will alarm. Wave will not (but generally this rarely happens, I have not seen)

ipmitool monitoring

  • Monitor the state of the processor
    here to see his status is not ok to see the column, but by Presence detected this column
    ipmitool sdr elist | grep -i cpu [ 0-2] _status
    correspondence management card
CPU0_Status      | 7Dh | ok  |  3.0 | Presence detected
CPU1_Status      | 7Eh | ok  |  3.0 | Presence detected
  • View memory status
    Note: The wave here is more disgusting, his memory of the sensor
    name is cpu (ha ---- Intuit) so when grep need to look at the name of his sensors. Between the model and the model name of his sensor is inconsistent. So do monitoring when it needs to be compatible.
    Here is his state value is a Presence Detected

ipmitool sdr elist | grep -i CPU[0-1]_C[0-1]D[0-1]

CPU0_C0D0        | 83h | ok  | 32.0 | Presence Detected
CPU0_C0D1        | 84h | ok  | 32.1 |
CPU0_C1D0        | 85h | ok  | 32.2 | Presence Detected
CPU0_C1D1        | 86h | ok  | 32.3 |
CPU1_C0D0        | 8Fh | ok  | 32.12 | Presence Detected
CPU1_C0D1        | 90h | ok  | 32.13 |
CPU1_C1D0        | 91h | ok  | 32.14 | Presence Detected
CPU1_C1D1        | 92h | ok  | 32.15 |
  • View the status of the hard drive socket
    ipmitool sdr elist | grep -i disk
下边是硬盘插槽
DISK0_Status     | B4h | ok  |  4.0 | Drive Present
DISK1_Status     | B5h | ok  |  4.1 | Drive Present
DISK2_Status     | B6h | ok  |  4.2 | Drive Present
DISK3_Status     | B7h | ok  |  4.3 | Drive Present
DISK4_Status     | B8h | ok  |  4.4 | Drive Present
DISK5_Status     | B9h | ok  |  4.5 | Drive Present
DISK6_Status     | BAh | ok  |  4.6 | Drive Present
DISK7_Status     | BBh | ok  |  4.7 | Drive Present
DISK8_Status     | BCh | ok  |  4.8 | Drive Present
DISK9_Status     | BDh | ok  |  4.9 | Drive Present
DISK10_Status    | BEh | ok  |  4.10 | Drive Present
DISK11_Status    | BFh | ok  |  4.11 | Drive Present
DISK12_Status    | C0h | ok  |  4.12 |
DISK13_Status    | C1h | ok  |  4.13 |
DISK14_Status    | C2h | ok  |  4.14 |
DISK15_Status    | C3h | ok  |  4.15 |
DISK16_Status    | C4h | ok  |  4.16 |
DISK17_Status    | C5h | ok  |  4.17 |
DISK18_Status    | C6h | ok  |  4.18 |
DISK19_Status    | C7h | ok  |  4.19 |
DISK20_Status    | C8h | ok  |  4.20 |
DISK21_Status    | C9h | ok  |  4.21 |
DISK22_Status    | CAh | ok  |  4.22 |
DISK23_Status    | CBh | ok  |  4.23 |
DISK24_Status    | D4h | ok  |  4.24 |
下边是硬盘背板插槽
DISK0_R_Status   | CCh | ok  |  4.0 | Drive Present
DISK1_R_Status   | CDh | ok  |  4.1 | Drive Present
DISK2_R_Status   | CEh | ok  |  4.2 |
DISK3_R_Status   | CFh | ok  |  4.3 |
DISK4_R_Status   | D0h | ok  |  4.4 |
DISK5_R_Status   | D1h | ok  |  4.5 |
DISK6_R_Status   | D2h | ok  |  4.6 |
DISK7_R_Status   | D3h | ok  |  4.7 |
  • Power information
    ipmitool sdr elist | grep -i psu [ 0-1] _status
PSU0_Status      | 74h | ok  | 10.0 | Presence detected
PSU1_Status      | 75h | ok  | 10.0 | Presence detected
  • 风扇状态信息
    ipmitool sdr elist| grep -i fan[0-9]_Present
FAN0_Present     | 60h | ok  | 29.0 | Device Present
FAN1_Present     | 61h | ok  | 29.1 | Device Present
FAN2_Present     | 62h | ok  | 29.2 | Device Present
FAN3_Present     | 63h | ok  | 29.3 | Device Present
  • 温度情况监控
    ipmitool sdr elist| grep -i temp
Inlet_Temp       | 00h | ok  | 12.0 | 22 degrees C
Outlet_Temp      | 01h | ok  | 55.1 | 32 degrees C
CPU0_Temp        | 06h | ok  |  3.0 | 28 degrees C
CPU1_Temp        | 07h | ok  |  3.0 | 26 degrees C
CPU0_DIMM_Temp   | 0Eh | ok  | 32.0 | 34 degrees C
CPU1_DIMM_Temp   | 0Fh | ok  | 32.0 | 32 degrees C
CPU0_VR_Temp     | 02h | ok  |  3.0 | 31 degrees C
CPU1_VR_Temp     | 03h | ok  |  3.1 | 30 degrees C
PCH_Temp         | 16h | ok  |  3.0 | 44 degrees C
OCP_Temp         | 29h | ns  | 11.0 | No Reading
NVME_Temp        | 28h | ns  | 11.1 | No Reading
PSU0_Temp        | 1Ch | ok  | 32.0 | 28 degrees C
PSU1_Temp        | 1Dh | ok  | 32.0 | 27 degrees C
RAID0_Temp       | 17h | ok  | 11.0 | 58 degrees C
RAID1_Temp       | 18h | ns  | 11.1 | No Reading
RAID2_Temp       | 19h | ns  | 11.2 | No Reading
RAID3_Temp       | 1Ah | ns  | 11.3 | No Reading
GPU0_Temp        | 20h | ns  | 11.0 | No Reading
GPU1_Temp        | 21h | ns  | 11.1 | No Reading
GPU2_Temp        | 22h | ns  | 11.2 | No Reading
GPU3_Temp        | 23h | ns  | 11.3 | No Reading
GPU4_Temp        | 24h | ns  | 11.4 | No Reading
GPU5_Temp        | 25h | ns  | 11.5 | No Reading
GPU6_Temp        | 26h | ns  | 11.6 | No Reading
GPU7_Temp        | 27h | ns  | 11.7 | No Reading
PCIE_SSD0_Temp   | A7h | ns  | 11.0 | No Reading
PCIE_SSD1_Temp   | A8h | ns  | 11.1 | No Reading
PCIE_SSD2_Temp   | A9h | ns  | 11.2 | No Reading
PCIE_SSD3_Temp   | AAh | ns  | 11.3 | No Reading
PCIE_SSD4_Temp   | ABh | ns  | 11.4 | No Reading
PCIE_SSD5_Temp   | ACh | ns  | 11.5 | No Reading
PCIE_SSD6_Temp   | ADh | ns  | 11.6 | No Reading
PCIE_SSD7_Temp   | AEh | ns  | 11.7 | No Reading
M.2_Inlet_Temp   | 05h | ok  | 55.0 | 28 degrees C
Rear_HDDBP_Temp  | 2Ah | ns  | 11.0 | No Reading
SWITCH0_Temp     | 4Ah | ns  | 11.0 | No Reading
SWITCH1_Temp     | 4Bh | ns  | 11.1 | No Reading
HDD_Max_Temp     | 2Bh | ok  | 11.0 | 32 degrees C

阵列监控

MegaCli64具体其他的使用可以百度一下

  • 硬盘信息输出
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -NoLog| egrep -iv "exit|Adapter"
Enclosure Device ID: 8 # id
Slot Number: 13 # 磁盘插槽
Enclosure position: 0
Device Id: 14
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]  #设备大小
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up # 磁盘的状态 就是监控磁盘的这个值的状态
SAS Address(0): 0x56c92bf001fa0bcd
Connected Port Number: 0(path0)
Inquiry Data: V6J3J9SS            HGST HUS726T4TALA6L4                    VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :27C (80.60 F) # 温度
  • 虚拟硬盘的信息获取
    他可能有很多的阵列,现在只是拿出其中一个说
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aAll -NoLog| egrep -iv "exit|Adapter"
Virtual Drive: 9 (Target Id: 9)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0  # 这里就是raid0
Size                : 3.637 TB
State               : Optimal # 这个是这个整列的状态,阵列的监控就是监控的这个值
Strip Size          : 64 KB  # 这个是他的条带 
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Number of Spans: 1
Span: 0 - Number of PDs: 1

# 下边是在这个整列中的磁盘信息,但是这里的磁盘信息需要注意,当磁盘信息是在线或者热备的时候会显示在这下边的列表中。
PD: 0 Information
Enclosure Device ID: 8
Slot Number: 10
Enclosure position: 0
Device Id: 17
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x56c92bf001fa0bca
Connected Port Number: 0(path0)
Inquiry Data: V6J3J1BS            HGST HUS726T4TALA6L4                    VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :28C (82.40 F)
  • 查看阵列卡的详细信息
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aAll
BBU status for Adapter: 0

BatteryType: CVPM02
Voltage: 9431 mV
Current: 0 mA
Temperature: 25 C

BBU Firmware Status:

 Charging Status              : None
 Voltage                                 : OK
 Temperature                             : OK
 Learn Cycle Requested                   : No
 Learn Cycle Active                      : No
 Learn Cycle Status                      : OK
 Learn Cycle Timeout                     : No
 I2c Errors Detected                     : No
 Battery Pack Missing                    : No
 Battery Replacement required            : No
 Remaining Capacity Low                  : No
 Periodic Learn Required                 : No
 Transparent Learn                       : No
 No space to cache offload               : No
 Pack is about to fail & should be replaced : No
 Cache Offload premium feature required  : No
 Module microcode update required        : No

Battery state:

GasGuageStatus:
 Fully Discharged        : Yes
 Fully Charged           : Yes
 Discharging             : Yes
 Initialized             : Yes
 Remaining Time Alarm    : No
 Remaining Capacity Alarm: Yes
 Discharge Terminated    : Yes
 Over Temperature        : No
 Charging Terminated     : Yes
 Over Charged            : No

 Pack energy             : 247 J
 Capacitance             : 110
 Remaining reserve space : 0


BBU Design Info for Adapter: 0

Date of Manufacture: 08/06, 2019
Design Capacity: 288 J
Design Voltage: 9500 mV
Serial Number: 1550
Manufacture Name: LSI
Device Name: CVPM02
Device Chemistry: EDLC
Battery FRU: N/A
TMM FRU: N/A
Module Version: 6635-02A


BBU Properties for Adapter: 0

Auto Learn Period: 2412000 Sec
Next Learn time: 634778466 Sec
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled

Guess you like

Origin www.cnblogs.com/yanghehe/p/12301417.html