Parsing IBM x3850 RAID5 server failure recovery program

[General Information]
Server Model: IBM X3850 server,
hard disk model: 73G SAS hard disk,
number of disks: five composed of a hard disk wherein a RAID5 4, the other one as a hot spare (Hot-Spare),
OS: linux redhat 5.3 , application system architecture in a oa oracle of.

[Performance] fault
No. 3 already offline, but hot spare disks are not activated automatically rebuild (for unknown reasons), followed by No. 2 offline, RAID crashes.
oracle no longer offer support for this oa system, as user requirements + data recovery operating system restoration.

[Conclusion] initial inspection
hot spare no fully enabled, no obvious physical hard disk failure, no synchronization performance. Data is usually recoverable.

[Recovery] program
1, the original environmental protection, shut down the server, make sure the server is no longer open during the recovery process.
2, the failed hard sort numbers to ensure that a hard disk can be removed completely restored slots.
3, the failure to mount the hard disk read-only environment, do exactly the mirror image (refer to <How to do a complete overall disk image backup>) on all hard disk failure. After completion of the backup returned to the original disk failure, the recovery operation after the acknowledgment until the data is no longer involved in the original disk failure before correct.
4, backup disk RAID configuration analysis, its original RAID level, rules strip, the strip size, checking direction, the META region and the like.
5, to build a set of virtual RAID5 RAID environment based on the information obtained.
6, virtual disk and file system interpreter.
7, detection of a virtual structure is correct, if not correct, the process is repeated 4-7.
8, after determining the correct data, fetches data according to user requirements. If still use the original disk, to be determined has been completely done on the original disk backup, rebuild RAID, do move back. When the operating system to move back, or may be used linux livecd win pe (generally not supported) or the like, can also be used to install an additional hard drive to move back in with the operating system on the failed server, and then move back to the sector level.
9, after the data transfer, which I extend storage data recovery data center for three days in order to avoid possible flaws overlooked.

[Cycle] Estimated
backup time: 2 hours
interpret and export data time: about 4 hours
to move back to the operating system: about 4 hours.

[Process] Detailed
1, a complete image of the original hard disk, the mirror Discovery 2 10-20 bad sectors, the disk rest had no bad sectors.
2, the structure analysis of the optimum structure obtained 0,1,2,3 disc sequence, missing number 3, the block size of 512 sectors, backward parity (Adaptec), the structure as shown below:
Parsing IBM x3850 RAID5 server failure recovery program
3, a good set of data verification, more than 200M latest archive decompression without error, determine the correct structure.
4, the structure of direct Click to generate a virtual RAID on a single hard drive, open the file system is no obvious error.
5, to determine the case of backup package safe, after the customer's consent, to rebuild the original disk RAID, has been using a new hard drive to replace the damaged No. 2 reconstruction. The good recovery with a single disk failure USB mode access server, and then start the linux SystemRescueCd failed server, followed by a comprehensive written back by the dd command.
6, after the write-back, start the operating system.
7, after all the data dd, start the operating system, can not enter, the error message is: /etc/rc.d/rc.sysinit: Line 1: / sbin / pidof: Permission denied, analyze a problem with this file permissions.
8, after the restart with SystemRescueCd check, this time the file permissions, the size of both obvious error, apparently damaged node.
9, the reparse data recombinant root partition, positioning error / sbin / pidof, found that problems due to the disc 2 caused by bad track.
10, using 0,1,3 these three disc, filled with an xor No. 2 for the damaged area. After recalibration filled file system, there is still an error, it checks the inode table again, the damaged area was found No. 2 showed some nodes (555,555 portion in the figure):
Parsing IBM x3850 RAID5 server failure recovery program
11, it is clear that, although the node UID described there normally, but the attribute, size, with an initial allocation block all wrong. Follow all possible analyzed to determine no way to recover this damaged node. You can only hope to fix this node, or a copy of the same file over. All documents may be wrong, the original node node information blocks are determined by the log, then a correction.
12, the root partition dd corrected again, perform fsck -fn / dev / sda5, detection, there is still an error, as shown below:
Parsing IBM x3850 RAID5 server failure recovery program
13 according to the prompt, the system has found a plurality of nodes share the same data block. Click prompts the underlying analysis found that, due to the number three dropped early, there is help old and new intersection node information.
14, according to the document node belongs to distinguish, after clearing the error node, perform fsck -fn / dev / sda5 again, the error message is still there, but very few. When prompted, we found that these nodes were located in the directory under the doc does not affect the system starts, so direct fsck -fy / dev / sda5 forced repair.
15, after the repair, reboot the system, successfully entered the desktop. Start the database service, launch the app, everything is normal, no error.

到此,数据恢复及系统回迁工作完成。

Guess you like

Origin blog.51cto.com/sun510/2448201