【Record】smartctl|How does Linux check whether there is a bad disk through smartctl? And use time, power cycle, fault records, etc.

  smartctl is a tool for monitoring and analyzing the health status of hard disks, and can be used to detect whether there are bad disks. Here are the steps to check disk health using smartctl:

  1. Install smartctl software
    In Linux systems, smartctl is usually included in the smartmontools package. If you have not installed smartmontools, you can use the following command to install it (most NAS comes with it, such as TrueNas):

    sudo apt-get install smartmontools    # Debian/Ubuntu
    sudo yum install smartmontools        # RedHat/CentOS
    
  2. Find Disk Device Names
    Use the following command to find disk device names:

    sudo fdisk -l
    
  3. Run smartctl to check disk
    Run smartctl to check disk with the following command:

    sudo smartctl -a /dev/sda
    

    Replace /dev/sdawith the name of the disk device you want to check. This command will list detailed information about the health status of the disk, including error count and predicted time to failure, etc. Note that you may need root privileges to run smartctl.

  4. Check smartctl output
    Find information about disk health in smartctl output. Focus on the following fields:

    • SMART overall-health self-assessment test result: If the value of this field is "PASSED", no health problems have been found on the disk. If the value of this field is "FAILED" or "UNKNOWN", there is a health issue with the disk.
      insert image description here
    • Reallocated_Sector_Ct: This is the number of sectors that have been reallocated. If the value is not 0, the disk has bad sectors. The normal RAW_VALUE is as follows:
      ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
        5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      
    • Current_Pending_Sector: This is the number of sectors currently waiting to be reallocated. If the value is not 0, the disk has potentially bad sectors.

  If no health issues are found in the smartctl output, then the disk should be fine. Otherwise, you may need to analyze the problem more deeply or consider replacing the disk.

  The following are some of the problems I personally encountered during the detection process.

1 When should it be used smartctl -t?

  The "-t" option of the smartctl command is used to perform a disk self-test in order to find any potential problems on the disk. When the "-t" option is used, smartctl will start a self-test in the background and provide the results upon completion. Therefore, if you want to get the latest self-test results for your disk , you need to run smartctl with the "-t" option.

  However, if you just want to check the current status of the disk without running a self-test, you can directly run the "smartctl -a" command to get the disk health status information. This command will list information about the current status and properties of the disk, such as temperature, power cycle, failure count, and more.

  To sum up, if you want to run a self-test to get the latest disk status information, you should use the "smartctl -t" command. If you just want to get information about the current state of the disk, you can use the "smartctl -a" command directly.

2 How to check the type of my disk and its supported SMART attributes?

  You can use the smartctl command to see your disk type and supported SMART attributes. Here's how to check them out:

  1. Run the "smartctl -i" command to get basic information about the disk. This command will list information such as the manufacturer, model, and serial number of the disk. For example:

    $ sudo smartctl -i /dev/sda
    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-89-generic] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     Samsung based SSDs
    Device Model:     Samsung SSD 970 EVO Plus 500GB
    Serial Number:    S5H7NX0N659415D
    LU WWN Device Id: 5 002538 8b0c6d96d
    Firmware Version: 2B2QEXM7
    User Capacity:    500,107,862,016 bytes [500 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Form Factor:      M.2
    TRIM Command:     Available, deterministic, zeroed
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-4, ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Tue Feb 15 19:48:47 2022 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
  2. Look for the "Device Model" property to get the disk model. For example, in the output above, the value of the "Device Model" property is "Samsung SSD 970 EVO Plus 500GB".

  3. Run the "smartctl -A" command to get a list of SMART attributes supported by the disk. This command will list the SMART self-testing and reporting attributes supported by the disk. For example:
    insert image description here

3 What about power cycle and fault counts?

  Power-On Hours (POH) and Error Counters are two common disk attributes in the output of the smartctl command. Here's how to check them out:

  Run the "smartctl -a" command to get the details of the disk. In the last part of the command output, you can see all SMART attributes and their current values.

  Look for the " Power_On_Hours " attribute. It is the power cycle property of the disk and represents the cumulative usage time since the disk was started, in hours.

insert image description here

  Look for the "Error_Counters" property. This is a property that contains several sub-properties that record the disk's error count. For example:

  1. "Raw_Read_Error_Rate" records the number of uncorrected errors while reading data from disk.
  2. "Seek_Error_Rate" records the number of errors encountered while seeking.
  3. "Spin_Retry_Count" records the number of times to retry reading data while spinning the disk.
  4. "Reallocated_Sector_Ct" records the number of bad sectors that have been reallocated.
  5. "Current_Pending_Sector" records the number of bad sectors that cannot be read currently.
  6. "Offline_Uncorrectable" records the number of uncorrectable bad sectors found in offline state.

insert image description here

  Any of these attributes may indicate a problem with the disk that needs further inspection or replacement.

  Note that different disks may use different property names and units of count . When looking at SMART attributes, be sure to review the full contents of the smartctl command output and carefully review any attributes that may indicate a disk problem .

4 Don't see the Error_Counters property in the output?

  If you do not see the "Error_Counters" attribute after running the "smartctl -a" command, it may be because your disk does not support this attribute, or it is not being reported correctly. Different disks may support different SMART attributes , so some attributes may not apply to your disk.

  In this case, you can look at other parts of the "smartctl -a" command output for other attributes that might indicate a disk problem. For example, you can view attributes related to disk health, such as temperature, noise, vibration, disk spin speed, and more. You can also view any messages reporting disk errors or warnings.

  Note that if you think there are problems with the disk but cannot confirm them with SMART attributes, there are other tools you can use for more comprehensive disk diagnostics and testing. For example, you can use the disk test tool provided by the disk manufacturer, or use a third-party tool to perform a full disk scan and test.

5 Is it normal that Seek_Error_Rate is very high?

It is usually not normal   to see a large number of Seek_Error_Rate values ​​in a SMART report . This attribute refers to the frequency of disk seek errors and is one of the key SMART indicators. Higher values ​​generally mean that there is a higher likelihood of a problem and should be noticed.

  However, disk SMART attributes should be interpreted according to the disk manufacturer's recommendations . Different manufacturers may use different algorithms and scales to calculate SMART attribute values. Also, some disks may have an automatic error correction feature, which means that even if there are some errors in the SMART report, the disk will still work fine.

  Therefore, it is recommended that you check the disk manufacturer's documentation for details on SMART attribute values, or you can provide the SMART report to the manufacturer for analysis for a more accurate interpretation. If you are concerned about the health of the disk, it is a good idea to back up important data and consider replacing the disk.

My hardware model is: ST16000NM000J-2TW103. Seagate calculates Seek_Error_Rate in different ways. For details, please refer to the article on station b: Seagate mechanical hard disk read error rate and seek error rate are high? Let's see how it's calculated!

Guess you like

Origin blog.csdn.net/qq_46106285/article/details/129041960