How to monitor the startup process of the EMC VNX controller

The content we are going to discuss here basically applies to all EMC VNX mid-range storage systems, including the old Clariion CX3, CX4, VNX1 and VNX2. In fact, many of the contents of VNXe and Unity are the same. Of course, due to the major changes in the VNXe and Unity operating systems, the differences are relatively large.

There are many reasons for the failure of EMC Clarrion CX and VNX storage controllers. The common ones are:

1. Controller physical failure

2. The IO module of the controller is faulty

3. System Disk Vault software or hardware failure

4. Memory failure in the controller

5. Storage operating system software bug

All the above reasons can cause the controller to hang up. The purpose of writing this article is to give the two knives popular science. Not all controllers are down offline. The problem can be solved by replacing the controller.

But how to judge whether it is a physical failure of the controller or a failure of other components, or a failure caused by software, this is a relatively professional content, and it is impossible to do it with a small blog post. You can communicate with us through wechat: StorageExpert professional judgment.

This article will give some basic ways that field engineers can operate to make some simple judgments.

1. Learn to read pictures

This is our favorite method, but it is also professional. Of course, after learning it, we can have a basic judgment. Regardless of whether it is CX3, CX4 or VNX1 and 2, the status lights of the controller are similar. Although the physical appearance of different controllers is different, you can find the following lights. We use CX or VNX5700/7500. The picture is As an example, other controllers also have similar three status lights. As shown below:

There are three LED lights on the controller,

1) Power indicator light, many people confuse this with the fault LED light of the controller. The power indicator light is very simple, it will always be on and green when it is powered on. If there is no point, it will be extinguished.

2) The fault SP indicator light of the controller, this light is very important, and it is through this light that the approximate problem of a controller can be judged.

3) Small white hand light, this light is a warning light, if this light is on, either it means that the controller is updating the firmware or only this controller is running, and the other controller has been hung up. It means don't move.

The following focuses on the meaning of the various states of the SP fault indicator light. Note that this indicator light is not static, it is always a process of change. Sometimes let the people on site look at the status of this indicator light and take a picture immediately. 3 seconds of video, this bird is useless. You have to observe the change of this indicator light, and then according to it, you have reached that state.

led light

light color

light status

illustrate

SP power

Blue

On

power on

off

no power

SP cage

Amber

On

Faulty, this fault can come from the whole chassis (power supply, environment, fan, io module LCC card, sp, CMI, SFP PROM, etc.)

Sometimes the on-site engineer is asked to check the status of the sp, and what is often given is the status of the light, mainly to find out.

off

Operating normally

SP Fault LED

(Normal start)

Amber

On (continuous)

SP failure

Blink once every four seconds

BIOS is executing

flashes every second

POST is executing

Flashes four times a second

PostStart booting the OS

Blue

blinks every four seconds

OS starts to boot

Blinks every 2 seconds

The SEP driver starts

four flashes per second

SEP driver startup complete

off

The operating system starts up or does not start up

SP Fault LED

( degrade start)

Amber

Blink once every four seconds

BIOS is executing

flashes every second

POST is executing

Flashes four times a second

PostStart booting the OS

Blue

blinks every four seconds

OS starts to boot

on last long blue

enter degrade mode

SP Fault LED

(faulty start)

Amber

on

There is a malfunction

Blinks every 2 seconds

NMI reset button pushed; blinking will continue until SP reboots and enters power on sequence.

Blinks at 1, 3, 3,

and 1 times a

second

There is a memory failure

Blue

on

failure occurs

SP unsafe to remove Little White Hand

White

on

 The SP peer has a panic or rebooted with the cache

performance mode enabled. The SP is holding valid cache in memory.

The SP is currently flashing the BIOS/Post firmware or updating the resume PROMs.

The SP is currently dumping the cache data to the vault.

off

The SP can be safely removed for service.

2. 对端控制器来监控过程

如果已经知道了控制器启动到了那个状态,或者根据指示灯大概判断问题在那个阶段,但具体还不知道是什么问题,就可以通过Remoteanywhere登录到正常的对端控制器,然后用工具speclcli来监控启动过程,这个对于故障发生在操作系统级别是比较有用的,可以看到具体那个driver开始重启。但对于判断物理硬件不是很好使,基本上它会告诉你POST以后出了问题,但具体是什么出了问题,就没有详细说明了。

remoteanywhere是对软件问题进行分析的必备工具,也是很复杂的,对VNX的软件体系架构不理解,这个工具基本没有帮助。因为查询出来的结果你是看不懂的。

3. 串口线

这个对于诊断物理故障是最有效的手段,插上串口线,一定要重启控制器,有些人说插上插口线以后,没有任何输出。对于VNX来说,没有任何输出是正常的。控制器启动以后,就没有任何输出了,只有启动过程中才有输出。

所以插上串口线以后,一定要重启控制器来收集这个boot log,从这个日志可以判断出是否是前面的几个系统盘出了问题,还是那个部分的硬件出了问题,都会有清晰的日志描述。

下面是一段Base module报错的格式,如果是其他IO module报错也会有类似的提示,或者DIMM内存报错等,都有类似的说明,非常清楚的可以看到是什么物理故障出现问题。

With the DAE added, when booting up, we have the following errors logging during the boot up.

.... Storage System Failure - Contact your Service Representative ...

ErrorCode: 0x00000907

ErrorDesc:

FRU: Base Module

Device: Base Module Card

Description: BMC indicated I/O module power disabled Error!

Rev: 40.41

Determine Module*

P/N: 303-224-000C-03

S/N: CFxxxxxxxxxxxx

EndError:

ErrorTime: 11/13/2013 23:16:24

WARNING: No SES driver GUID found: Expander

.... Storage System Failure - Contact your Service Representative ...

ErrorCode: 0x00000907

ErrorDesc:

FRU: Base Module

Device: Base Module Card

Description: BMC indicated I/O module power disabled Error!  <<<--

Rev: 40.41

Determine Module*

P/N: 303-224-000C-03

S/N: CFxxxxxxxxxxxx

EndError:

ErrorTime: 11/13/2013 23:16:24

这里就不一一列举各种物理故障的输出,有了输出可以联系我们来一起看,wechat at StorageExpert。

最后纠正大家一个概念,EMC的VNX存储系统没有控制器同步的概念,控制器中没有OS操作系统,有的就是一些物理部件的固件,这个系统自己会根据当前的OS来进行升级或者降级,不需要用户干预和关心。操作系统都在前面四个系统盘上,控制器里面对于存储OS是啥也没有,以后就不要再说,控制器不同步这样太不专业的话了。

Guess you like

Origin blog.csdn.net/m0_72255440/article/details/131137997