NetApp FAS Controller Replacement Details & Troubleshooting

foreword

The replacement of NetApp controllers is not a routine operation, and the normal replacement process is not complicated, but problems are all intractable diseases. In the past six months, I have done two controller replacement cases in relatively large environments, and almost all pitfalls of controller replacement may be encountered. Arrived, this article as a sorting, mainly divided into the following three parts:

  • Part 1: Explain the basic process and precautions of replacing the controller
  • Part 2: Record a FAS8200 replacement process and troubleshooting
  • Part III: Questions and Conclusions

Replacement process

Taking FAS8200 as an example (other models are similar)
insert image description here
the overall process is briefly described in official documents. Under normal circumstances, the whole process is not complicated:

Replace the controller module hardware - FAS8200
链接:https://docs.netapp.com/us-en/ontap-systems/fas8200/controller-replace-move-hardware.html#step-1-open-the-controller-module

First of all, it needs to be explained that under normal circumstances, the replacement of the controller only replaces the controller main board, and does not include other related hardware modules on the main board. Therefore, it is necessary to manually remove and install the relevant modules on the old controller to the new controller.

Step 1: Disassemble the controller module
and pull out the damaged controller directly from the chassis
insert image description here

Step 2: Remove the boot medium.
The boot medium is installed with the Ontap system, which is an important module. After replacement, ensure that the versions of the two controllers are the same. The location is as follows
insert image description here

Step 3: Remove the cache battery
The position is as follows, note that there is a buckle
insert image description here

Step 4: Remove the memory
and try to install the memory in the original location and order
insert image description here

Step 5: Remove the relevant PCIe card (optional)
High-end models may additionally pretend to be PCIe cards according to actual application scenarios, including SAS, UTA2 cards, etc., if necessary, install it to a new controller

Step 6: Remove the Flash Cache module
The Flash Cache cache that comes with the machine
insert image description here

Step 7: Install to the new controller and start confirmation
Finally, install the above components to the new controller in their original positions, and insert them into the chassis to wait for startup
. Update all IDs and the IDs of all disks, and confirm after entering the system. So far, the controller replacement is complete.

Of course, the above is just a smooth situation. In fact, I have replaced about 10 controllers in my production environment, and the real smoothness may be about half. The following records a recent toss.

case record

Project overview: The customer replaced the controller with FAS8200, and the system version Ontap 9.1P7
took about 3 hours for the entire replacement process plus troubleshooting. The following steps omitted most of the troubleshooting process (only the recorded log is close to 2M), and there are many in the middle The record is lost, just for
reference

I won’t elaborate on the hardware replacement part. After the replacement is completed, start up normally and find that the System ID has changed. Select y to update. After entering the system, Waiting for giveback is normal. Since the default waiting time is long, choose to manually give back

Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
Configuring Devices ...

CPU = 1 Processor(s) Detected.
  Intel(R) Xeon(R) CPU D-1587 @ 1.70GHz (CPU 0)
  CPUID: 0x00050664. Cores per Processor = 16
131072 MB System RAM Installed.
SATA (AHCI) Device: SV9MST6D120GLM41NP

Boot Loader version 6.0.10 
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2020 NetApp, Inc. All Rights Reserved.

Starting AUTOBOOT press Ctrl-C to abort...
Loading X86_64/freebsd/image1/kernel:0x200000/10377696 0xbe59e0/6360256 Entry at 0xffffffff80294bf0
Loading X86_64/freebsd/image1/platform.ko:0x11f7000/2513560 0x145d000/393664 0x14bd1c0/543024 
Starting program at 0xffffffff80294bf0
NetApp Data ONTAP 9.1P7
ata2: AHCI reset done: devices=00000001
Trying to mount root from msdosfs:/dev/ad4s1 [ro]...
md0 attached to /X86_64/freebsd/image1/rootfs.img
Trying to mount root from ufs:/env/md0.uzip []...
mountroot: waiting for device /env/md0.uzip ...
Copyright (C) 1992-2017 NetApp.
All rights reserved.
Writing loader environment to the boot device.
Loader environment has been saved to the boot device.
*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************  
Sat Jun 13 18:50:40 2015 [nv2flash.restage.progress:NOTICE]: ReStage is not needed because the flash has no data.

WARNING: System ID mismatch. This usually occurs when replacing a boot device or NVRAM cards!
Override system ID? {
    
    y|n} y
No SVM keys found.
Firewall rules loaded.
Jun 14 02:51:07 Power outage protection flash de-staging: 17 cycles
Ipspace "ACP" created
WAFL CPLEDGER is enabled. Checklist = 0x7ff841ff
Waiting for giveback...(Press Ctrl-C to abort wait)

Manual return fails, prompting Partner is missing disks.

XXXX_FAS8200::> storage failover giveback -ofnode cluster2-01 

Warning: System ID changed on partner. Disk ownership will be updated with new
         system ID. Do you want to continue? {
    
    y|n}: y

Error: command failed: Failed to initiate giveback. Reason: Partner is missing
       disks. 

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    -        Waiting for giveback
cluster2-02    cluster2-01    false    System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887), Normal
                                       giveback not possible: partner
                                       missing file system disks
2 entries were displayed.

Manually stop the state of waiting for return, and find that the controller itself cannot read the disk

Pausing to check HA partner status ... 
lock was released, continuing boot ...
Waiting for disk ownership to change.........................Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:88118750:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:880A5F74:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.inventory.mismatch:error]: Status of the disk ?.? (50000398:88119AC4:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk. 
Jun 14 02:57:55 [cluster2-01:cf.disk.invent.mismatchalt:ALERT]: Status of some of the disks has changed or the node (cluster2-01) is missing 143 disks (detailed logs have been throttled). 
Jun 14 02:57:55 [cluster2-01:callhome.sfo.miscount:error]: Call home for HA GROUP ERROR: DISK/SHELF COUNT MISMATCH 
......Jun 14 02:58:07 [cluster2-01:raid.assim.tree.noRootVol:error]: No usable root volume was found! 
WARNING: 0 disks found!

It can be seen that most of the disks were lost after the controller was replaced. After further checking the disk information, the ownership is still the System ID of the old controller, and it has not been updated. Trying to refresh it has no effect.

XXXX_FAS8200::> storage failover show -ins
                                            Node: cluster2-02
                                    Partner Name: cluster2-01
                                   Node NVRAM ID: xxxxxx816
                                Partner NVRAM ID: xxxxxx944
                                Takeover Enabled: true
                                         HA Mode: ha
                               Takeover Possible: false                        
                    Reason Takeover not Possible: Local node is already in takeover state.
                                 Interconnect Up: true
                              Interconnect Links: RDMA Interconnect is up (Link up)
                               Interconnect Type: GOP (PLX PEX8725 NTB)
                               State Description: System ID changed on partner (Old: xxxxxx944, New: xxxxxx887), In takeover
                                   Partner State: Initializing
                             Time Until Takeover: -
         Reason Takeover not Possible by Partner: Local node is already in takeover state.
                           Auto Giveback Enabled: false
                           Check Partner Enabled: true
                  Takeover Detection Time (secs): 15
                       Takeover on Panic Enabled: true
                      Takeover on Reboot Enabled: true
               Delay Before Auto Giveback (secs): 600
                         Hardware Assist Enabled: true
                           Partner's Hwassist IP: 10.33.21.13
                         Partner's Hwassist Port: 4444
           Hwassist Health Check Interval (secs): 180
                            Hwassist Retry Count: 2                            
                                 Hwassist Status:                              
                 Time Until Auto Giveback (secs): -                            
                             Local Mailbox Disks: 2.0.0.P3, 2.0.1.P3           
                           Partner Mailbox Disks: 2.0.12.P3, 2.0.13.P3         
                     Missing Disks on Local Node: None                         
                   Missing Disks on Partner Node: 3.20.20, 3.20.11, 3.20.22, 3.20.8, 3.20.12, 3.20.18, 3.20.9, 3.20.5, 3.20.13, 3.20.17, 3.20.23, 3.20.21, 3.20.15, 3.20.19, 3.20.10, 3.20.14, 3.20.7, 3.20.0, 3.20.4, 3.20.16, 3.20.6, 3.20.2, 3.20.1, 3.20.3, 3.21.22, 3.21.21, 3.21.23, 3.21.20, 3.21.17, 3.21.18, 3.21.16, 3.21.19, 3.21.15, 3.21.12, 3.21.13, 3.21.14, 3.21.11, 3.21.10, 3.21.9, 3.21.5, 3.21.8, 3.21.6, 3.21.7, 3.21.2, 3.21.4, 3.21.1, 3.21.3, 3.21.0, 3.22.23, 3.22.22, 3.22.18, 3.22.21, 3.22.19, 3.22.20, 3.22.17, 3.22.16, 3.22.15, 3.22.14, 3.22.12, 3.22.13, 3.22.11, 3.22.10, 3.22.9, 3.22.8, 3.22.6, 3.22.7, 3.22.5, 3.22.4, 3.22.3, 3.22.2, 3.22.1, 3.22.0, 3.23.20, 3.23.23, 3.23.18, 3.23.22, 3.23.17, 3.23.21, 3.23.19, 3.23.16, 3.23.15, 3.23.7, 3.23.14, 3.23.13, 3.23.8, 3.23.12, 3.23.11, 3.23.9, 3.23.10, 3.23.6, 3.23.5, 3.23.3, 3.23.4, 3.23.2, 3.23.0, 3.23.1, 3.24.23, 3.24.22, 3.24.21, 3.24.20, 3.24.18, 3.24.19, 3.24.17, 3.24.16, 3.24.15, 3.24.14, 3.24.13, 3.24.12, 3.24.10, 3.24.11, 3.24.9, 3.24.8, 3.24.7, 3.24.6, 3.24.3, 3.24.5, 3.24.4, 3.24.1, 3.24.0, 3.24.2, 3.25.23, 3.25.22, 3.25.21, 3.25.20, 3.25.19, 3.25.18, 3.25.17, 3.25.16, 3.25.15, 3.25.14, 3.25.12, 3.25.13, 3.25.11, 3.25.10, 3.25.9, 3.25.8, 3.25.7, 3.25.5, 3.25.4, 3.25.3, 3.25.2, 3.25.1, 3.25.0
                             Time Since Takeover: 00:23:22
           Auto Giveback After Takeover On Panic: false
            Bypass Takeover Optimization Enabled: false
           Auto-giveback Override Vetoes Enabled: false
Auto Giveback Delay Before Terminating CIFS (minutes): 5
                                         HA Type: none
2 entries were displayed.   

XXXX_FAS8200::> run -node cluster2-02 -command disk show -v
  DISK       OWNER                    POOL   SERIAL NUMBER         HOME                     DR HOME                CHKSUM
------------ -------------            -----  -------------         -------------            -------------          --------
0a.00.16     cluster2-01(xxxxxx944)    Pool0  S396NX0J601836        cluster2-01(xxxxxx944)                          Block
0a.00.16P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601836NP001   cluster2-01(xxxxxx944)                          Block
0a.00.16P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601836NP002   cluster2-01(xxxxxx944)                          Block
0a.00.16P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601836NP003   cluster2-01(xxxxxx944)                          Block
0a.00.23     cluster2-01(xxxxxx944)    Pool0  S396NX0J601740        cluster2-01(xxxxxx944)                          Block
0a.00.23P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601740NP001   cluster2-01(xxxxxx944)                          Block
0a.00.23P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601740NP002   cluster2-01(xxxxxx944)                          Block
0a.00.23P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601740NP003   cluster2-01(xxxxxx944)                          Block
0a.00.2      cluster2-02(xxxxxx816)    Pool0  S396NX0J601781        cluster2-02(xxxxxx816)                          Block
0a.00.2P1    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP001   cluster2-02(xxxxxx816)                          Block
0a.00.2P2    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP002   cluster2-02(xxxxxx816)                          Block
0a.00.2P3    cluster2-02(xxxxxx816)    Pool0  S396NX0J601781NP003   cluster2-02(xxxxxx816)                          Block
0a.01.11     cluster2-02(xxxxxx816)    Pool0  S396NX0J601890        cluster2-02(xxxxxx816)                          Block
0a.01.11P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP001   cluster2-02(xxxxxx816)                          Block
0a.01.11P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP002   cluster2-02(xxxxxx816)                          Block
0a.01.11P3   cluster2-02(xxxxxx816)    Pool0  S396NX0J601890NP003   cluster2-02(xxxxxx816)                          Block
0b.20.14     cluster2-02(xxxxxx816)    Pool0  S396NX0JC34076        cluster2-01(xxxxxx944)                          Block
0b.20.16     cluster2-02(xxxxxx816)    Pool0  S396NX0K105816        cluster2-01(xxxxxx944)                          Block
0b.20.3      cluster2-02(xxxxxx816)    Pool0  S396NX0JC34082        cluster2-01(xxxxxx944)                          Block
0a.01.14     cluster2-01(xxxxxx944)    Pool0  S396NX0J601796        cluster2-01(xxxxxx944)                          Block
0a.01.14P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601796NP001   cluster2-01(xxxxxx944)                          Block
0a.01.14P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601796NP002   cluster2-01(xxxxxx944)                          Block
0a.01.14P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601796NP003   cluster2-01(xxxxxx944)                          Block
0a.00.19     cluster2-01(xxxxxx944)    Pool0  S396NX0J601931        cluster2-01(xxxxxx944)                          Block
0a.00.19P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601931NP001   cluster2-01(xxxxxx944)                          Block
0a.00.19P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601931NP002   cluster2-01(xxxxxx944)                          Block
0a.00.19P3   cluster2-01(xxxxxx944)    Pool0  S396NX0J601931NP003   cluster2-01(xxxxxx944)                          Block
0a.01.19     cluster2-01(xxxxxx944)    Pool0  S396NX0J601835        cluster2-01(xxxxxx944)                          Block
0a.01.19P1   cluster2-02(xxxxxx816)    Pool0  S396NX0J601835NP001   cluster2-01(xxxxxx944)                          Block
0a.01.19P2   cluster2-02(xxxxxx816)    Pool0  S396NX0J601835NP002   cluster2-01(xxxxxx944)                          Block

Enter the node mode and try to force the exchange, no error is reported

cluster2-02(takeover)*> cf giveback -f
System ID changed on partner. Giveback will update the ownership of partner disks with system ID: xxxxxx887.
Do you wish to continue {
    
    y|n}? y

The exchange process proceeds normally

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    -        Waiting for cluster applications to
                                       come online on the local node
                                       Offline applications: mgmt, vldb,
                                       vifmgr, bcomd, crs, scsi blade, clam.
cluster2-02    cluster2-01    true     System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-01, Partial
                                       giveback
2 entries were displayed.

XXXX_FAS8200::> storage failover show-giveback 
               Partner
Node           Aggregate         Giveback Status
-------------- ----------------- ---------------------------------------------
cluster2-01
                                 No aggregates to give back
cluster2-02
               CFO Aggregates    Done
               aggr1_cluster2_01
                                 Not attempted yet
               aggr3_cluster2_01
                                 Not attempted yet
               aggr5_cluster_01  Not attempted yet
               aggr2_cluster2_01
                                 Not attempted yet
6 entries were displayed.

After waiting for a while, check that the cluster status is normal, but the controller still displays missing disk, and the takeover status is false

XXXX_FAS8200::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
cluster2-01           true    true
cluster2-02           true    true
2 entries were displayed.

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    false    System ID changed on local (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-02, Takeover
                                       is not possible: Local node missing
                                       partner disks
cluster2-02    cluster2-01    true     System ID changed on partner (Old:
                                       xxxxxx944, New: xxxxxx887),
                                       Connected to cluster2-01, Giveback
                                       of one or more SFO aggregates failed
2 entries were displayed.

Try the failover giveback again, the command can be executed normally, but the error is still reported

XXXX_FAS8200::> storage failover giveback -ofnode cluster2-01 

Warning: System ID changed on partner. Disk ownership will be updated with new
         system ID. Do you want to continue? {
    
    y|n}: y

Info: Run the storage failover show-giveback command to check giveback status. 

XXXX_FAS8200::> event log show
Time                Node             Severity      Event
------------------- ---------------- ------------- ---------------------------
6/22/2023 09:23:12  cluster2-01      ALERT         cf.disk.invent.mismatchalt: Status of some of the disks has changed or the node (cluster2-01) is missing 143 disks (detailed logs have been throttled).
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E370:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E6E0:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:12  cluster2-01      ERROR         cf.disk.inventory.mismatch: Status of the disk ?.? (5002538A:4812E550:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000) has recently changed or the node (cluster2-01) is missing the disk.
6/22/2023 09:23:07  cluster2-01      ERROR         scsitarget.ispfct.linkBreak: Link break detected on Fibre Channel target adapter 0e. Firmware status code status1 0x2, status2 0x7, and status4 0x0.
6/22/2023 09:22:32  cluster2-01      ERROR         asup.post.drop: AutoSupport message (HA Group Notification from cluster2-01 (REBOOT (watchdog reset)) ALERT) for host (0) was not posted to NetApp. The system will drop the message.

After checking the disk information carefully, it is found that the System ID of most of the disks has changed, and some of them have not been changed.

XXXX_FAS8200::> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01    cluster2-02    false    Connected to cluster2-02, Partial
                                       giveback, Takeover is not possible:
                                       Local node missing partner disks
cluster2-02    cluster2-01    true     Connected to cluster2-01. Node owns
                                       aggregates belonging to another node
                                       in the cluster.
2 entries were displayed.

After a long time of investigation here, it was finally found that a disk cabinet controller was damaged, which caused the disk to be affected by ownership (of course, this logic is not very understandable, it is a loopback series connection)

XXXX_FAS8200::*> storage shelf show
                                                                      
                                                       Module Operational
       Shelf Name Shelf ID Serial Number   Model       Type   Status
----------------- -------- --------------- ----------- ------ -----------

Warning: Unable to list entries for kernel on node "cluster2-01": RPC: Couldn't
         make connection.
              2.0        0 SHFFG1739000083 DS224-12    IOM12  Normal
              2.1        1 SHFFG1739000084 DS224-12    IOM12  Normal
             2.10       10 SHFFG1802000239 DS224-12    IOM12  Normal
             2.11       11 SHFFG1751000243 DS224-12    IOM12  Normal
             2.12       12 SHFFG1826000243 DS224-12    IOM12  Normal
             2.13       13 SHFFG1826000245 DS224-12    IOM12  Normal
             3.20       20 SHFFG1809000126 DS224-12    IOM12  Normal
             3.21       21 SHFFG1810000394 DS224-12    IOM12  Normal
             3.22       22 SHFFG1810000390 DS224-12    IOM12  Normal
             3.23       23 SHFFG1810000392 DS224-12    IOM12  Error
             3.24       24 SHFFG1810000389 DS224-12    IOM12  Normal
             3.25       25 SHFFG1810000391 DS224-12    IOM12  Normal
12 entries were displayed.

After replacing the cabinet controller, manually refresh the ownership, but still report the same error, because it prompts Node owns aggregates belonging to another node in the cluster, so I plan to try to migrate the AGGR on B controller to A controller

XXXX_FAS8200::*> storage disk refresh-ownership -node cluster2-02

Normal migration of several AGGRs from B control to A control

XXXX_FAS8200::*> storage aggregate relocation start -node cluster2-02 -destination cluster2-01 -aggregate-list aggr1_cluster2_01

The status is finally back to normal

XXXX_FAS2720::*> storage failover show
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
cluster2-01     cluster2-02     true     Connected to cluster2-02
cluster2-02     cluster2-01     true     Connected to cluster2-01
2 entries were displayed.

Summarize

  • The logic of NetApp configuration is very strong, and you need to understand the underlying logic in order to play through it.
  • Versions before Ontap 9.3 do have a lot of bugs, and I have encountered a lot recently. It is recommended to upgrade to 9.5 and above if possible.
  • It is recommended to purchase the original factory warranty for the core storage
  • Other issues related to controller replacement are also welcome to communicate

Guess you like

Origin blog.csdn.net/sjj222sjj/article/details/131791395