EMC data storage failure analysis report Raid

A fault described
user EMC FC AX-4 storage collapse phenomenon, the entire memory space of the hard disk 12 1TB STAT composition, wherein 10 a RAID5 array of hard drives, the rest two make hot spare disk. Since two hard disk arrays RAID5 damage occurs, but this time only a hot spare disk successfully activated, thus resulting in paralysis RAID5 array, the upper LUN unusable.
EMC data storage failure analysis report Raid
Second, detect the disk
because disk storage because some dropped, causing the entire memory unusable. So after receiving the disk to do the physical inspection of all disk, after testing found no physical failure. Then use detection tools to detect bad sectors disk bad sectors, found no bad sectors.
Third, the backup data
taking into account the security, and restore data, you need to do a backup of all data sources before doing data recovery in case other causes of data can not be recovered again. Use winhex all disks into image files, since the source disk sector size is 520 bytes, it is also necessary to use special tools to do the backup data of all bytes 520 to 512 conversion.
Fourth, fault analysis and recovery
1, fault analysis

since the front two steps and does not detect failures or disk physical bad sectors, infer some disk read and write may be unstable and cause failure. Because EMC disk controller checks the policy is very strict, once certain disk performance is unstable, EMC disk controller is considered to be bad, it will be considered to be kicked out of bad disks RAID disk group. Once the RAID group dropped disk RAID level to achieve the permissible limit off the plate, then the RAID group will become unavailable on the upper LUN RAID group will become unavailable. A preliminary understanding of the current situation is based on only one LUN RAID group assigned to SUN small machine, the upper file system ZFS.
2, analysis RAID group structure
It is necessary to analyze the underlying RAID group information, and reconstruct the original information of the RAID group in accordance with the stored analysis EMC LUN are based RAID group. Analyzing each piece of data disks, No. 8 and No. found 11 no data, the management interface can be seen from the No. 8 and No. 11 are all Hot Spare, but a Hot Spare No. 8 No. 5 is replaced bad disk. Thus, although based on a Hot Spare No. 8, although successful activation, but due to a RAID5 RAID level, RAID group in this case is also a lack of a hard disk, resulting in data is not synchronized to the No. 8 hard drive. Continuing to analyze the hard disk 10 other sequences, Regularity data distribution in the hard disk, the size of the RAID stripe, and each disk.
3, analysis RAID group dropped disc
according to RAID information on the above analysis, the attempt by North RAID virtual program will be developed out of the original virtual RAID group. But because the entire RAID group dropped a total of two discs, and therefore need to analyze the order of these two hard drives dropped. Careful analysis of every piece of data on your hard disk, a hard drive found in the same band of data and other hard obviously not the same, so the initial judge this drive may be the first dropped by the independent development of North Asia RAID parity program to check this band to do, just get rid of that piece of analysis found that the data obtained is the best hard disk, so you can clearly dropped the first hard drive.
4, analysis of the information in the RAID group LUN
since LUN is based on the RAID group, it is therefore necessary to recombination according to information of the RAID group analysis. LUN allocation information is then analyzed in the RAID group, and the MAP data block allocated LUN. As the underlying only one LUN, and therefore only need to analyze a LUN information on OK. Then use this information in North raid recovery (datahf.net) program, interpretation of data LUN MAP and export all the data LUN.
Fifth, the interpretation ZFS file system and repair
1, explained ZFS file system

North Asia use data recovery (datahf.net) developed ZFS file system file system interpreter to do the interpretation of the resulting LUN, the program found an error in the interpretation of certain file system metadata file of the time. Promptly arrange for engineers to develop a program to do the debug, analysis program error causes. Then arrange the file system ZFS file system engineers to analyze it because the version of the reasons, causes the program does not support. After analysis and debugging of up to 7 hours and found ZFS file system storage due to sudden paralysis resulting in some damage metafile, which causes the program to explain the ZFS file system can not be interpreted properly.
2, repair ZFS file system
above analysis clearly ZFS file system due to paralysis of the storage part of the file system metadata file is corrupted, and therefore needs to be done to repair these damaged file system metadata file in order to resolve the ZFS file system is normal. Analysis of the damaged metafile found, because the original ZFS file IO operations while being stored paralyzed, causing part of the file system metadata file is not updated and damage. Artificial meta file manually repair these damaged, ensure ZFS file system can resolve properly.
Sixth, export all the data
using a program to repair the ZFS file system to do a good resolve, resolve all the nodes file and directory structure. Part of the file directory shots are as follows:
EMC data storage failure analysis report Raid
seven verify the latest data
because the data types are text and DCM images, need to build a lot of environment. Pointing the user side by the engineer to verify some of the data, verify that the results are not the problem, the data are complete. Part of the file verification are as follows:
EMC data storage failure analysis report Raid
EMC data storage failure analysis report Raid
eight data recovery conclusions
due to the failure to save the environment a good spot, it will not do the operation related to the risk of late data recovery is very helpful. The entire data recovery process had to face a lot of technical bottlenecks, but also solve them. Finalized within the expected time data recovery, data acceptance by the user is correct, so far the data recovery work is completed.

Guess you like

Origin blog.51cto.com/sun510/2401250