Memory failure: CE error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x0 offset:0x0 grain:8 syndrom

Note: I am a system engineer, and I am not familiar with hardware servers, so I cannot directly locate the damaged memory location. Generally speaking, the server memory failure is displayed on the server panel, but the server panel is not displayed in this article, and the memory is damaged. The location information is also displayed when the server inadvertently pops up when I replace the memory one by one. Before that, the server restarted six times, and no important information was displayed. The normal server panel is displayed as shown below:

 

Description of the phenomenon :

During the system inspection, it was found that there were a large number of memory error messages in the dmesg log:

Error message:

[ 9.238875] EDAC MC0: 6 CE error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x0 offset:0x0 grain:8 syndrome:0x0)

Log information split description :

EDAC is error detection and correction (error detection and correction), which is an internal mechanism of the Linux system. MC is memory controller (memory controller). CE stands for correctable error, which is a correctable error in ECC memory, and relatively UE (uncorrectable error).

Symptoms:

The server starts normally, and the user login page can be displayed, but after about ten seconds, the server will automatically generate a crash log, and then a black screen.

 

Troubleshooting process:

1. During the restart process of the server, the faulty memory location (the 5th memory slot of CPU No. 1) is reported.

 Note: (The server restarted six times before this message)

2. Open the cover of the server, find the memory module in this position, pull it out, and replace the memory

 

3. After the replacement is complete, start the server and check that the system returns to normal and the error message disappears

 

Guess you like

Origin blog.csdn.net/weixin_50877409/article/details/127654062