Server raid5 two hard disk offline vxfs file system data recovery method

Server data recovery failure description
The customer's server has a total of 8 450GB SAS hard disks, of which 7 hard disks form a RAID5 array and 1 hot spare disk. Two hard disks in the array were damaged and went offline, causing the RAID5 array to be paralyzed and affecting the normal use of upper-layer LUNs. The hard disk has no physical failure and no bad sectors.

Data recovery process for server raid:

1.
Use the dd command or data recovery tool to back up data to mirror all disks into files.
Figure 1:
Server raid5 two hard disk offline vxfs file system data recovery method
2. Analysis of RAID group structure
The LUNs of the server are all based on RAID groups, so it is necessary to analyze the information of the underlying RAID group first, and then reconstruct the original RAID group based on the data. Through analysis, we know that the No. 4 disk is a hot spare disk. Continue to analyze the distribution of Oracle database pages in each disk, and then obtain important information about the RAID group, such as the stripe size of the RAID group, disk order and data direction.
3. Analyze the RAID group disconnected disk
According to the RAID information analyzed above, try to virtualize the original RAID group through the RAID virtual program. Carefully analyze the data in each hard disk, verify the stripe through the RAID verification program independently developed by Beiya, and remove the first disconnected hard disk from the raid.
4. Analyze the LUN information in the RAID group
Since the LUN is based on the RAID group, it is necessary to virtualize the latest status of the RAID group according to the information analyzed above. Then analyze the LUN allocation in the RAID group and the data block MAP allocated by the LUN. Since there are 6 LUNs at the bottom layer, it is only necessary to extract the data block distribution MAP of each LUN. Then write a corresponding program based on this information, analyze the data MAP of all LUNs, and then export the data of all LUNs according to the data MAP.
Figure 2:
Server raid5 two hard disk offline vxfs file system data recovery method
5. Parse LVM logical volume
After analyzing all generated LUNs, it is found that all LUNs contain HP-Unix LVM logical volume information. I tried to parse the LVM information in each LUN, and found that there are three sets of LVMs. Among them, the 45G LVM is divided into one LV, which stores the data of the OA server, and the 190G LVM is divided into one LV, which stores temporary backup data. The remaining 4 LUNs form an LVM of about 2.1T, and only one LV is divided into which Oracle database files are stored. Write a program to interpret LVM, try to interpret the LV volumes in each LVM set, but find that the interpreter has an error.
6. Repair the LVM logical volume.
Carefully analyze the cause of the program error, arrange the development engineer to debug the location of the program error, and at the same time arrange the senior file system engineer to test the recovered LUN to detect whether the LVM information will be caused by storage paralysis. The information of the LMV logical volume damage. After careful inspection, it was found that the LVM information was damaged due to storage paralysis. Try to manually repair the damaged area, and synchronously modify the program to reparse the LVM logical volume.
7. Analyze the VXFS file system
to build an HP-Unix environment, map the interpreted LV volume to HP-Unix, and try the Mount file system. As a result, an error occurred in the Mount file system. Try to use the "fsck -F vxfs" command to repair the vxfs file system, but the repair result still cannot be mounted. It is suspected that some metadata of the underlying vxfs file system may be damaged, and manual repair is required.
8. Repair the VXFS file system.
Carefully analyze the parsed LV, and check whether the file system is complete according to the underlying structure of the VXFS file system. The analysis found that there was indeed a problem with the underlying VXFS file system. It turned out that when the storage was paralyzed, this file was performing IO operations in the system, so some file system metafiles were not updated and damaged. Manually repair these damaged metafiles to ensure that the VXFS file system can be parsed normally. Mount the repaired LV volume to the HP-Unix computer again, and try to mount the file system. The file system is successfully mounted without any error.
9. Recover all user files
After mounting the file system on an HP-Unix machine, back up all user data to the specified disk space. All user data size is around 1.2TB. Screenshots of some file directories are as follows:
Figure 3:
Server raid5 two hard disk offline vxfs file system data recovery method
10. Check whether the database files are complete
Use the Oracle database file detection tool "dbv" to check whether each database file is complete, and no errors are found. Then use the Oracle database detection tool independently developed by Beiya (the inspection is more strict), and find that some database files and log files are inconsistent in the verification, arrange for senior database engineers to repair such files, and verify them again until all files are verified. Tests are completely passed.
11. Start the Oracle database.
Since the HP-Unix environment we provide does not have this version of Oracle data, we coordinate with the user to bring the original production environment to the North Asia Data Recovery Center, and then attach the restored Oracle database to the HP-Unix of the original production environment. On the Unix server, try to start the Oracle database, and the Oracle database starts successfully. Part of the screenshots are as follows:
Figure 4:
Server raid5 two hard disk offline vxfs file system data recovery method
12. Data verification
is carried out by the user side, starting the Oracle database, starting the OA server, and installing the OA client on the local notebook. The latest data records and historical data records are verified through the OA client, and the user arranges remote personnel from different departments to perform remote verification. The final data verification is correct, the data is complete, and the data recovery is successful.
After the failure occurs, the storage site environment is good, and there is no need to do related dangerous operations, which is of great help to the later data recovery. Although many technical bottlenecks were encountered during the entire data recovery process, they were all resolved one by one. Finally, the entire data recovery was completed within the expected time, and the recovered data users were also quite satisfied. All services such as Oracle database service and OA server could be started normally.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324601724&siteId=291194637