QPI: Rx detected CRC error - successful LLR without Phy re-init

Before May 1st, the monitoring suddenly reported that a database host restarted, and the restart was completed when logging in to the system. Check the message log, and the error is as follows

Apr 28 13:56:33 hydb1 kernel: mce: [Hardware Error]: Machine check events logged
Apr 28 13:56:33 hydb1 kernel: mce: [Hardware Error]: Machine check events logged
Apr 28 13:56:33 hydb1 mcelog: Hardware event. This is not a software error.
Apr 28 13:56:33 hydb1 mcelog: MCE 0
Apr 28 13:56:33 hydb1 mcelog: CPU 20 BANK 21
Apr 28 13:56:33 hydb1 mcelog: MISC 1df87b000d9eff
Apr 28 13:56:33 hydb1 mcelog: TIME 1682661382 Fri Apr 28 13:56:22 2023
Apr 28 13:56:33 hydb1 mcelog: MCG status:
Apr 28 13:56:33 hydb1 mcelog: MCi status:
Apr 28 13:56:33 hydb1 mcelog: Error overflow
Apr 28 13:56:33 hydb1 mcelog: Corrected error
Apr 28 13:56:33 hydb1 mcelog: MCi_MISC register valid
Apr 28 13:56:33 hydb1 mcelog: MCA: BUS error: 2 20 Level-3 Generic Generic Other-transaction Request-did-not-timeout
Apr 28 13:56:33 hydb1 mcelog: QPI: Rx detected CRC error - successful LLR without Phy re-init
Apr 28 13:56:33 hydb1 mcelog: STATUS c80001c000310e0f MCGSTATUS 0
Apr 28 13:56:33 hydb1 mcelog: MCGCAP 7000c16 APICID 40 SOCKETID 2
Apr 28 13:56:33 hydb1 mcelog: CPUID Vendor Intel Family 6 Model 79
Apr 28 13:56:33 hydb1 mcelog: Hardware event. This is not a software error.
Apr 28 13:56:33 hydb1 mcelog: MCE 1
Apr 28 13:56:33 hydb1 mcelog: CPU 20 BANK 21
Apr 28 13:56:33 hydb1 mcelog: MISC 1df87b000d9eff
Apr 28 13:56:33 hydb1 mcelog: TIME 1682661382 Fri Apr 28 13:56:22 2023
Apr 28 13:56:33 hydb1 mcelog: MCG status:
Apr 28 13:56:33 hydb1 mcelog: MCi status:
Apr 28 13:56:33 hydb1 mcelog: Error overflow
Apr 28 13:56:33 hydb1 mcelog: Corrected error
Apr 28 13:56:33 hydb1 mcelog: MCi_MISC register valid
Apr 28 13:56:33 hydb1 mcelog: MCA: BUS error: 2 20 Level-3 Generic Generic Other-transaction Request-did-not-timeout
Apr 28 13:56:33 hydb1 mcelog: QPI: Rx detected CRC error - successful LLR without Phy re-init
Apr 28 13:56:33 hydb1 mcelog: STATUS c800008000310e0f MCGSTATUS 0
Apr 28 13:56:33 hydb1 mcelog: MCGCAP 7000c16 APICID 40 SOCKETID 2
Apr 28 13:56:33 hydb1 mcelog: CPUID Vendor Intel Family 6 Model 79
Apr 28 13:56:33 hydb1 mcelog: Hardware event. This is not a software error.
Apr 28 13:56:33 hydb1 mcelog: MCE 2
Apr 28 13:56:33 hydb1 mcelog: CPU 20 BANK 21
Apr 28 13:56:33 hydb1 mcelog: MISC 1ff87b000d9eff
Apr 28 13:56:33 hydb1 mcelog: TIME 1682661382 Fri Apr 28 13:56:22 2023

Because the hardware is still in the maintenance period, I immediately call the original factory 400 to deal with it, and I will reply after the holiday. I need to upgrade the firmware, and arrange downtime to do the firmware upgrade. After the upgrade, I have observed it for 4 days, and it is currently running normally.

Guess you like

Origin blog.csdn.net/kevinyu998/article/details/130631895