How do I interpret scsi status messages in RHEL like "sd 2:0:0:243: SCSI error: return code = 0x0800

Issue

 

Environment

  • Red Hat Enterprise Linux (RHEL) 2.1, 3, 4, 5, 6

 

Resolution

Each return code consists of four parts and is returned by the scsi mid-layer. Note that the return code can sometimes appear truncated, that is has less than 4 parts, as leading zeros are sometimes suppressed when output.

 

Raw

0x08  00  00  02
   D   C   B   A
               A: 

status_byte

= set from target device, e.g. SAM_STAT_CHECK_CONDITION B :

msg_byte

= return status from host adapter itself, e.g. COMMAND_COMPLETE C :

host_byte

= set by low-level driver to indicate status, e.g. DID_RESET D :

driver_byte

= set by mid-level, e.g. DRIVER_SENSE kernel/kernel-2.6.9/linux-2.6.9/include/scsi/scsi.h:: #define status_byte(result) ( (result) >> 1) & 0x1f) {note: see scsi.h -- this is NOT the preferred method. See [A] below} #define msg_byte(result) (((result) >> 8) & 0xff) #define host_byte(result) (((result) >> 16) & 0xff) #define driver_byte(result) (((result) >> 24) & 0xff)

The following sections list the code values for each of the above four parts.

 



[A] SCSI Status Byte


The scsi status bytes is almost always from the storage target/device itself. The one exception is the DID_ERROR status which is set and returned by the driver when it detects some anomalous condition with the returned status. One common such condition is when the returned status from storage is success but the residual byte (transferred data) count is non-zero.... the command is saying both "its all good" and "I didn't do everything you asked me to do" at the same time -- these contradict each other and the driver doesn't know what to make of the information so returns a DID_ERROR, driver internally detected an error. The cause is still the same: something at the storage target/device itself.

 

Note that a value of 00 within the byte of the return code might mean either a success or that the io never got sent to storage and the error is described elsewhere in the status word. For example, a host status byte of DID_NO_CONNECT means the host had no route/path to the specified storage target and therefore unable to send the requested command. In this case the scsi status byte will be the default value of 00 but that doesn't mean a successful scsi command status in this case.

 

From scsi.h:

Raw

  :
  * SCSI Architecture Model (SAM) Status codes, Taken from SAM-3 draft
  * T10/1561-D Revision 4 Draft dated 7th November 2002, 
  */

#define SAM_STAT_GOOD                       0x00
#define SAM_STAT_CHECK_CONDITION            0x02
#define SAM_STAT_CONDITION_MET              0x04
#define SAM_STAT_BUSY                       0x08
#define SAM_STAT_INTERMEDIATE               0x10
#define SAM_STAT_INTERMEDIATE_CONDITION_MET 0x14
#define SAM_STAT_RESERVATION_CONFLICT       0x18
#define SAM_STAT_COMMAND_TERMINATED         0x22 /* obsolete in SAM-3 */
#define SAM_STAT_TASK_SET_FULL              0x28
#define SAM_STAT_ACA_ACTIVE                 0x30
#define SAM_STAT_TASK_ABORTED               0x40

The above are more fully defined within the SCSI standards. The following information is a synopsis of what these status mean:

 

Status Hex Description
GOOD 00 Target has successfully completed the command.
CHECK CONDITION 02 Indicates a contingent allegiance condition has occurred (see sense buffer for more details)
CONDITION MET 04 Requested operation is satisfied.
BUSY 08 Indicates the target is busy. Returned whenever a target is unable to accept a command from an otherwise acceptable initiator.
INTERMEDIATE 10 Shall be returned for every successfully completed command in a series of linked commands (except the last command)
INTERMEDIATE-CONDITION MET 14 Combination of CONDITION MET and INTERMEDIATE status's.
RESERVATION CONFLICT 18 Logical unit or an extent (portion) within the logical unit is reserved for another device.
COMMAND TERMINATED 22 Target terminates the current I/O process. This also indicates that a contingent allegiance condition has occurred.
QUEUE FULL (TASK SET FULL) 28

Shall be implemented if tagged command queuing is supported. Indicated that the target command queue is full.

ACA ACTIVE 30 Indicates an auto contingent allegiance condition exists.
     
All other codes   Reserved.

The Green code is what we're hoping for, the Gold ones are the most common non-success statuses that one might see.  The Blue ones are rarely seen. The Red ones are very uncommon and almost never are seen.

 

Note:See below from scsi.h. The status_byte() macro listed above from this same file does do a shift by which as stated in the same file isNOT the preferred method. The SAM_STAT_* values showing the whole byte of the scsi command status is the preferred method within the code. However, there is still code that refers to just the embedded 5 bit field as the scsi status which can cause confusion. See [SCSI Status Code, Sense Buffer (sense key, asc, ascq) Quick Reference](/knowledge/node/20391) for more information on the scsi status byte format.

 

From scsi.h: (the referenced SAM status codes are listed above)

Raw

 :
 * Status codes, These are deprecated as they are shifted 1 bit right
 * from those found in the SCSI standards, This causes confusion for
 * applications that are ported to several OSes, Prefer SAM Status codes
 * above,
 */

#define GOOD 0x00
#define CHECK_CONDITION 0x01
#define CONDITION_GOOD 0x02
#define BUSY 0x04

 



[B] Message byte, msg_byte()


 

From scsi.h:

 

Raw

  :
  * MESSAGE CODES
  */

#define COMMAND_COMPLETE                0x00
#define EXTENDED_MESSAGE                0x01
#define EXTENDED_MODIFY_DATA_POINTER    0x00
#define EXTENDED_SDTR                   0x01
#define EXTENDED_EXTENDED_IDENTIFY      0x02    /* SCSI-I only  */
#define EXTENDED_WDTR                   0x03
#define SAVE_POINTERS                   0x02
#define RESTORE_POINTERS                0x03
#define DISCONNECT                      0x04
#define INITIATOR_ERROR                 0x05
#define ABORT                           0x06
#define MESSAGE_REJECT                  0x07
#define NOP                             0x08
#define MSG_PARITY_ERROR                0x09
#define LINKED_CMD_COMPLETE             0x0a
#define LINKED_FLG_CMD_COMPLETE         0x0b
#define BUS_DEVICE_RESET                0x0c

#define INITIATE_RECOVERY               0x0f    /* SCSI-II only */
#define RELEASE_RECOVERY                0x10    /* SCSI-II only */
#define SIMPLE_QUEUE_TAG                0x20
#define HEAD_OF_QUEUE_TAG               0x21
#define ORDERED_QUEUE_TAG               0x22

 



[C] Host byte, host_byte()


 

From scsi.h:

 

Raw

  :
  * Host byte codes 
  */
 ____RHEL____
 5    6   7           Name                   Value     Description
 x    x   x   #define DID_OK                  0x00  /* NO error                                */
 x    x   x   #define DID_NO_CONNECT          0x01  /* Couldn't connect before timeout period  */
 x    x   x   #define DID_BUS_BUSY            0x02  /* BUS stayed busy through time out period */
 x    x   x   #define DID_TIME_OUT            0x03  /* TIMED OUT for other reason              */
 x    x   x   #define DID_BAD_TARGET          0x04  /* BAD target.                             */
 x    x   x   #define DID_ABORT               0x05  /* Told to abort for some other reason     */
 x    x   x   #define DID_PARITY              0x06  /* Parity error                            */
 x    x   x   #define DID_ERROR               0x07  /* Internal error                          */
 x    x   x   #define DID_RESET               0x08  /* Reset by somebody.                      */
 x    x   x   #define DID_BAD_INTR            0x09  /* Got an interrupt we weren't expecting.  */
 x    x   x   #define DID_PASSTHROUGH         0x0a  /* Force command past mid-layer            */
 x    x   x   #define DID_SOFT_ERROR          0x0b  /* The low level driver just wish a retry  */
 x    x   x   #define DID_IMM_RETRY           0x0c  /* Retry without decrementing retry count  */
 x    x   x   #define DID_REQUEUE             0x0d  /* Requeue command (no immediate retry) also
                                                     * without decrementing the retry count    */
 +[1]  x   x   #define DID_TRANSPORT_DISRUPTED 0x0e  /* Transport error disrupted execution
                                                     * and the driver blocked the port to
                                                     * recover the link. Transport class will
                                                     * retry or fail IO */
 +[1]  x   x   #define DID_TRANSPORT_FAILFAST  0x0f /* Transport class fastfailed the io        */
 +[2]  x   x   #define DID_TARGET_FAILURE      0x10 /* Permanent target failure, do not retry on
                                                    * other paths                              */
 +[2]  x   x   #define DID_NEXUS_FAILURE       0x11 /* Permanent nexus failure, retry on other
                                                    * paths might yield different results      */
 -    +[3] x   #define DID_ALLOC_FAILURE       0x12 /* Space allocation on device failed        */
 -    +[3] x   #define DID_MEDIUM_FAILURE      0x13 /* Medium error                             */

Notes:
[1]RHEL 5: Added in RHEL 5.4 and later.
[2]RHEL 5: Added in RHEL 5.8 and later.
[3]RHEL 6: Added in RHEL 6.6 and later.

 

 



[D] Driver byte, driver_byte()


 

From scsi.h:

 

Raw

:
#define DRIVER_OK           0x00    /* Driver status                        */

/*
 *  These indicate the error that occurred, and what is available
 */

#define DRIVER_BUSY         0x01
#define DRIVER_SOFT         0x02
#define DRIVER_MEDIA        0x03
#define DRIVER_ERROR        0x04
#define DRIVER_INVALID      0x05
#define DRIVER_TIMEOUT      0x06
#define DRIVER_HARD         0x07
#define DRIVER_SENSE        0x08    < sense buffer available from target about event
#define SUGGEST_RETRY       0x10
#define SUGGEST_ABORT       0x20
#define SUGGEST_REMAP       0x30
#define SUGGEST_DIE         0x40
#define SUGGEST_SENSE       0x80
#define SUGGEST_IS_OK       0xff
#define DRIVER_MASK         0x0f
#define SUGGEST_MASK        0xf0

Return Code Information


00.00.00.18 RESERVATION CONFLICT


Raw


0x00.00.00.18
           18   status byte : SAM_STAT_RESERVATION_CONFLICT - device reserved to another HBA, 
                                                              command failed
        00         msg byte : <{likely} not valid, see other fields>
     00           host byte : <{likely} not valid, see other fields>
  00            driver byte : <{likely} not valid, see other fields>



===NOTES===================================================================================

    o 00.00.00.18 RESERVATION CONFLICT
        . this is a scsi status returned from the target device

Example:

-------------------------------------------------------------------------------------------

reservation conflict and Error Code: 0x00000018  in messages for the lun
   . two types of reservations
        . reserve/release (typically tapes, not disks, exclusive between device and one hba)
        . persistent reservations (survives storage power cycle) (typically cluster fencing
          method for disks) shared reservation across multiple initiators
   . see 

How can I view, create, and remove SCSI reservations and keys?

, to see if lun is reserved by another host If there are a lot of "reservation conflict" messages in dmesg, that normally means that the scsi device which the system tried to access is reserved by another node and it cannot be accessed at that time. To resolve the problem, you should verify that the device access configuration of your application is correct or not, and contact your application vendor to get further assistance. Also see

Why I got a lot of "reservation conflict" on dmesg when system bootup?

---------------------------------------------------------------------------------------------- Notes: had case on vm guest with reservation conflicts. $ sg_persist --in -k -d /dev/sdc $ sg_persist --in -r -d /dev/sdc showed clean device without PR reservations of any kind at the guest. A sg_tur probably would have helped to detect if reserve/release was being used on these devices. But also, with virt layers its possible they were injecting the reservation conflict. Especially since return code was very odd: return code = 0x00110018 0x11 host byte is undefined in RHEL prior to RHEL 5.8 (this was 5.6)... probably from the 3rd party hypervisor so should contact 3rd party hypervisor vendor.

 
 

 


00.01.00.00 DID_NO_CONNECT


Raw


0x00 01 00 00
           00   status byte : {likely} not valid, see other fields
        00         msg byte : {likely} not valid, see other fields
     01           host byte : DID_NO_CONNECT - couldn't connect before timeout period 
                                               {possibly device doesn't exist}
  00            driver byte : {likely} not valid, see other fields

=== NOTES ===================================================================================

        + often means device is no longer accessible,
        + sometimes issued after burning through all retries
        + other times issued immediately if the target device is no longer connected
          to the san (for example, storage port disconnected from san)

Example:
kernel: qla2xxx 0000:0d:00\,1: LOOP DOWN detected (2 5 0)
kernel: sd 0:0:0:1: SCSI error: return code = 0x00010000

Summary:
IO could not be issued as there is no connection to the device.

Description:
The IO command is being rejected with an error status of DID_NO_CONNECTION. Either access to the
device is temporarily unavailable (as in this example, a LOOP or LINK DOWN condition has
occurred - we don't know when the connection will return), or the device is no longer available
within the configuration. This might happen if the storage is reconfigured to remove that lun
from being exported to this host for example. This status is not immediately returned, but is
returned after the timeout period has expired. In other words, the io is queued up and ready to
go but has no place to be sent.

More/Next Steps:
Review the messages files to see if there is an explanation of why the device isn't available
LOOP and LINK DOWN events are explicit, RSCN processing that removes a lun or nport less so. If
the DID_NO_CONNECT is being reported across multiple luns on an HBA, then likely there is an
unexpected storage related issue present. If the status is only being returned against one lun
and if there are multiple paths to this lun, across multiple HBAs then likely the lun was
removed from the configuration. This has inadvertently occurred on shared storage in the past.

Additional information that is important includes: is this a permanent condition or does the
device return at some point? Does a reboot bring the device back? Are multiple devices
involved? Multiple HBAs? Multiple systems (that share the storage)? Is this a one-off event or
burst, is it continuous, or it is happening sporadically over time?

Developing a storage diagram of what systems are attached to which nports on the storage
controller and which luns are being exported to where can be useful in understanding the
complexity of the storage configuration. 

In some cases, actions being performed within the storage controller can cause this type of
behavior were there any maintenance or configuration changes being taken at the time of the
events? There are ways of tuning the system to be more tolerant of storage controllers going
away suddenly such that the kernel is able to wait longer periods of time and ride through SAN
or controller perturbations.

Additional steps that could be taken would be to

1. Have the customer engage their storage vendor if there is no explanation and you are 
   unable to recover access to the device.
2. Turn on additional kernel messages such as additional scsi or driver extended logging, and/or
3. Look into changing multipath and driver parameters, such as
4. lpfc_devloss_tmo (earlier lpfc_nodev_tmo), increase the timeout the lpfc driver waits for a
   device to return
5. no_path_retry, increase the number of retries that multipath waits for the path to return
   for example
6. baseline loading information may need to be gathered (`iostat`, `vmstat`, `top`, etc.)

Recommendations:
Request step 1 and get the vendor case number posted within the ticket. We'll need this if/when
engaging the vendor from our side of things -- if it comes to that.

Recommend step 2 to be implemented upon customer discretion. Its often fairly lightweight in
terms of system impact depending on what flags are set.

Request step 3 only if the storage vendor doesn't find anything and is looking for additional
information/cooperation on the issue. Sometimes issues are related to loading. That shouldn't
be the case for DID_NO_CONNECT, but if other avenues have been exhausted then this is something
to try.


 
 

 


00.02.00.00 DID_BUS_BUSY


Raw


0x00 02 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     02           host byte : DID_BUS_BUSY - bus stayed busy throughout the timeout period
                                             {transport failure likely}
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

        + Parallel SCSI buses: there is an actual bus busy signal/line on the bus. 
          While the bus is busy, the adapter cannot arbitrate for the bus
          to send out the next command. If the bus stays busy too long, this error
          results. For example, some older tapes were guilty of holding onto the bus 
          during long multi-megabyte transfers so as to not loose streaming and 
          so while backups were running this behavior would result. Typically you 
          shouldn't be seeing this type of behavior though.

        + usually a retriable event, expectation is "bus" will become unbusy at some point 
          where "bus" means transport to device and not the device itself 

        + *if* the bus busy is associated with RSCN message processing (look for information
          in messages around the same time), then the bus busy condition may be related to 
          RSCN processing. The `lpfc` and `qla2xxx` drivers have a configuration option to 
          change the processing of RSCN messages which can, in some cases, prevent timeouts 
          and associated bus busy reporting. The options would be added to `/etc/modprobe.conf`
          and a new `initrd` created plus reboot in order to put these new options into effect.

          Note that these options may not be available on older distributions so verify that  
          they are available within the kernel and that the problem seems related to RSCN
          processing. The driver post 4.4 release should contain this option plus the enhanced
          RSCN processing code.  See BZ 213921 for more information

          options lpfc lpfc_use_adisc=1
          options qla2xxx ql2xprocessrscn=1

        + Some common SAN issues that could contribute to getting this status are link down or 
          still being recovered status, the port status change processing (for example the above
          RSCN processing), low level FC buffer protocol issues (not enough buffer credits to  
          exchange frames), etc.  The bus busy from recovering link or port processing is a 
          result of said processing taking longer than usual the secondary result is that
          command trying to be sent are blocked/stalled too long resulting in "bus busy"
          status.  That is, "I couldn't send this command out because the bus was unavailable
          for too long a period of time".


Example:
kernel: SCSI error : <2 0 0 1> return code = 0x20000
kernel: Buffer I/O error on device sdd, logical block 3

Summary:
A bus busy condition is being returned back from the hardware which prevents the command from
being processed.

Description:
In the majority of cases there is some underlying SAN/storage issue causing this either 
directly or indirectly.

More/Next Steps:
1. review messages file for other events
2. collect iostat/vmstat/blktrace from system, compare/contrast io rates while the problem
   isn't being reported vs when it is.  The only works well if collected data is from all
   boxes connected to shared storage ports.
3. have san storage support look at switch and hba statistics to see if any counters are incrementing

Recommendations:
Gather iostats information to see if the problem is load induced or related.
Engage san/storage support group to review san.

 


00.02.00.08 DID_BUS_BUSY + {SCSI} BUSY


Raw


0x00 02 00 08
           08   status byte : SAM_STAT_BUSY - device {returned} busy {status}
        00         msg byte : {likely} not valid, see other fields;
     02           host byte : DID_BUS_BUSY - bus stayed busy throughout the timeout period
                                             {transport failure likely}
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================


        + since the scsi command status returned from the target is BUSY status, this is 
          truely device is busy problem and not a transport issue.  Sometimes a device goes 
          busy when the storage controller is reset or if the device itself is undergoing 
          an internal reset and has not finished yet or a raid volume rebuild is in progress
          and access is either blocked or slow.  Usually retries will ride through 
          these busy times and eventually complete ok.  If not, then the target device/storage 
          controller should be reviewed as to why this issue is occurring.

        + exception: lpfc driver, its sets this particular status when the nport is not in the 
          expected NLP_STE_MAPPED_NODE state but does exist.  Still a storage side issue, but 
          setting lpfc logging for ELS and FCP would gather more information on when the port 
          changed state (ELS) and what the response packet/command back from the adapter had 
          for additional information (FCP).  

The SAM_STAT_BUSY is the scsi status code back from the target (disk, device).

SCSI Status
Hex     Description
08      BUSY               Indicates the target is busy. Returned whenever a target is 
                           unable to accept a command from an otherwise acceptable initiator.

Example:
Aug 27 08:36:49 hostname kernel: lpfc 0000:07:00.0: 0:1305 Link Down Event x6 received Data: x6 x20 x80110
Aug 27 08:36:49 hostname kernel: sd 1:0:0:400: SCSI error: return code = 0x00020008
Aug 27 08:36:49 hostname kernel: end_request: I/O error, dev sdk, sector 3267384
Aug 27 08:36:49 hostname kernel: device-mapper: multipath: Failing path 8:160.
Aug 27 08:36:49 hostname multipathd: 8:160: mark as failed

Summary:

Description:

More/Next Steps:
1. If lpfc, set additional logging via echo 4115 > /sys/class/scsi_host/host/lpfc_log_verbose
   might be useful.
   4115 = 0x1013 ; LOG_ELS, LOG_DISCOVERY, LOG_LINK_EVENT, LOG_FCP_ERROR


Recommendations:

 


00.04.00.00 BAD_TARGET


Raw


0x00 04 00 00
              00   status byte : {likely} not valid, see other fields;
           00         msg byte : {likely} not valid, see other fields;
        04           host byte : DID_BAD_TARGET - bad target
     00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

        + Something about the behavior of this target is being flagged, need to 
          look into the driver, messages, and other data to determine specifically 
          why its being flagged as a bad target.

        + This is a software detected hardware problem (data provided back from device
          is out of range, has conflicting information, or fails sanity checks)

    + For different drivers it can mean different things. For example:
          - iscsi, sets this status if 
              . it receives a scsi CHECK CONDITION status back from the device but 
                an invalid buffer size (device is saying check the sense data and 
                then not providing proper sense data to look at)
              . it receives a scsi UNDERRUN or OVERRUN condition but the residual 
                byte count does not match expectations
          - megaraid, sets this if 
              . the lun number is too high (lun number higher than supported by driver)
          - libata, sets this if 
              . device no longer present (as if it got caught in hot-plug device removal)

Example:
kernel: sd 0:0:1:0: SCSI error: return code = 0x00040000
kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: raid1: Disk failure on sda3, disabling device.


Summary:
Target device is not usable.

Description:
This may be transitory, but if permanent usually other events logged in messages at/near same
timeframe. The target device is no longer available to send commands to.

More/Next Steps:

Recommendations:
Engage storage h/w support, likely storage side issue present.


 


00.07.00.00 DID_ERROR - driver internal error


 

Raw


0x00 07 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     07           host byte : DID_ERROR - internal error
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================


      + DID_ERROR - driver detected an internal error condition within the response data
        returned from storage hardware.

      + DID_ERROR is assigned within the driver when the returned data from the HBA
        doesn't make sense (like SUCCESS scsi status on io command, but residual byte count
        returned indicates not all requested data was transferred).

      + DID_ERRORs are often akin to software detecting that some type of hardware error
        present.

      + How and where a specific driver detects such anomalies is driver type and version
        dependent.

      + Most places within the driver that set DID_ERROR are also covered by extended event
        logging, so turning on the driver's additional logging often will provide additional
        information as to specific cause.
          o also, for FC HBAs monitor the FC port statistics, if available.  See
            

"What do the fc statistics under /sys/class/fc_host/hostN/statistics mean"

+ generally with FC drivers/hba, DID_ERROR is assigned when a hardware or san-based issue is present within the storage subsystem such that the received fibre channel response frames received back from the hba at command completion time contains invalid or has conflicting information in some way. Some command examples of why DID_ERROR is returned: o The FC response frame indicates the response length field is valid, but this means, by FC specification, that the length must be 0, 4, or 8 but the length field is not any of these allowed sizes. o The scsi protocol data includes a sense buffer and indicates the whole of included scsi data is N bytes, but the FC "wrapper" indicates that it is carrying only X byte of encapsulated protocol data where X < N. For example, the scsi data might provide a sense buffer length of 24 bytes, but the fibre channel frame indicates it is carrying only 8 bytes of total scsi data -- essentially in impossible condition for the driver to reconcile. o Other cases are similar: - a data overrun where there shouldn't be, - a data underrun but the hba's count of frames and data is different from the information within response frames, - a invalid or unexpected status returned by the lpfc's firmware. - queue full condition, but underrun and residual byte counts weren't updated or set by storage controller so they don't agree - underrun condition specified in status from storage but residual byte count does not match up with the transferred byte count within the HBA's firmware - underrun condition detected due to dropped frame (storage returned success, but the host didn't receive all the data transmitted to it by storage) - overrun condition detected, HBA's transferred byte count exceeds the count within the storage FC response frame. - firmware detected invalid fields within the FC response frame from storage, such as incorrect io entry order, count, parameter, and/or unknown status type. - driver received information from the HBA firmware that doesn't match a known valid response. + Upon completion of a scsi command, storage communicates the command status back to the host. In a fibre channel environment this is done by sending back an FC response frame. The frame consist of two parts -- FC "wrapper" contain protocol neutral information and a payload within the frame which contains SCSI protocol specific data. The scsi data includes scsi status, sense buffer (if available), etc. o A common cause of DID_ERRORs are when the information within the FC "wrapper" conflicts with the information within the scsi status/sense information. o Need to check the specific driver source code as to why a specific instance of DID_ERROR is being returned. o To determine where within the driver the specific instance of DID_ERROR is occurring, driver extended event logging will need to be enabled. Example: Oct 28 13:33:24 hostname kernel: sd 2:0:0:2: SCSI error: return code = 0x00070000 Oct 28 13:33:24 hostname kernel: end_request: I/O error, dev sdc, sector 1010494514 Oct 28 13:47:23 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000 Oct 28 13:47:23 hostname kernel: end_request: I/O error, dev sda, sector 7831441538 Oct 28 13:47:56 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000 Oct 28 13:47:56 hostname kernel: end_request: I/O error, dev sda, sector 9141689458 Oct 28 13:48:06 hostname kernel: sd 2:0:0:2: SCSI error: return code = 0x00070000 Oct 28 13:48:06 hostname kernel: end_request: I/O error, dev sdc, sector 6798352378 Oct 28 13:48:34 hostname kernel: sd 2:0:0:0: SCSI error: return code = 0x00070000 Oct 28 13:48:34 hostname kernel: end_request: I/O error, dev sda, sector 4050283778 Summary: The driver detected an anomalous condition within the returned completion information from storage. Description: The driver is detecting an error condition with the io completion information and setting DID_ERROR to signify that the io completion is suspect. Typically a DID_ERROR indicates a h/w (hba/cable/switch/storage) side issue. More/Next Steps: Review the messages. Is the problem only occurring on specific scsi devices (all 0:0:*:* would indicate just one scsi host, but if its happening on 0:0:*:* and say 1:0:*:*) then its happening on multiple scsi hosts. How frequently? Ultimately this is a h/w side issue but which devices or device sets are effected influence where the problem could be located on the h/w side of things. Review the systems hardware, switch error counters, etc. to see if there is any indication of where the issue might lie. The most likely candidate is the hba itself. Recommendations: 1. engage storage vendor support 2. check switch error counters 3. monitor host side HBA error counters 4. turn on extended driver logging, if available, most drivers like lpfc and qla2xxx drivers that set DID_ERROR have extended logging in and around most (but not all!) places within the driver that set DID_ERROR. The driver is reporting that it is receiving odd/unexpected/invalid information from the hba. This generally indicates an issue within the SAN(external to the OS). Review the system's hardware, switch error counters, etc. to see if there is any indication of where the issue might lie. The most likely candidate is the hba itself. Were you able to replace the HBA? Could also be a bad GBIC or cable. If desired, driver extended logging can be enabled. It won't log additional information in all cases, but in most cases it will. The data provided may not provide additional insight into the problem, but if you wish to enable it anyways: + To turn on extended driver logging for the qla2xxx driver, see

"[Troubleshooting] How do I turn on additional qla2xxx or qla4xxx driver extended logging and what logging is available?"

+ To turn on extended driver logging for the lpfc driver, use flag value 64 (LOG_FCP) and see

"[Troubleshooting] How do I turn on additional lpfc driver extended logging and what logging is available?"

NOTE: this will generate a LOT of logging with most of it being normal stuff that is completely unrelated to any problems and the information logged isn't easy to decode in a straight forward way. As a result, the recommendation is to not enable this logging unless absolutely necessary or recommended within a support case.

 


00.07.00.28 SAM_STAT_TASK_SET_FULL + DID_ERROR - queue full condition


 

Raw


0x00.07.00.28
           28   status byte : SAM_STAT_TASK_SET_FULL - essentially a target queue full condition
        00         msg byte : <{likely} not valid, see other fields>
     07           host byte : DID_ERROR - internal error
  00            driver byte : <{likely} not valid, see other fields>

===NOTES===================================================================================

    o 00.07.00.28 SAM_STAT_TASK_SET_FULL (storage device queue full condition) + 
          +------ DID_ERROR

Example:

kernel: sd 7:0:3:42: Unhandled error code
kernel: sd 7:0:3:42: SCSI error: return code = 0x00000028
kernel: Result: hostbyte=HOST_OK driverbyte=DRIVER_OK,SUGGEST_OK

Summary:
The key issue is the queue full condition.  The DID_ERROR is likely secondary to the queue
full.  The queue full condition is being returned by storage indicating that it is not
able to handle any additional io requests at that time.  If the queue fulls are only 
happening occasionally, then they can be safely ignored.  The kernel will retry the io.
However, it they happen frequently or frequently enough so that they are having impact
on the system then they need to be addressed. 

Description:
DID_ERRORs are assigned by the driver upon detection of conflicting status/information
returned from storage (for example successful io completion but residual byte count 
non-zero, that is the read or write completed "successfully" but didn't read or write 
all the data -- to which the driver goes "huh?!" and sets DID_ERROR.  

The queue full condition means too many io commands have arrived at the storage device 
exceeding the queue limit within storage.  IO could end up being dropped resulting in 
timeouts, and in general its a red flag from storage that the connected host or hosts 
are overdriving it.

The lun queue_depth (see /sys/block/sd*/device/queue_depth) is on a per lun basis.  If 
there are a lot of luns exported to storage to this host or the set of luns exported to 
the set of hosts that share storage is a large number, then this can lead to queue full 
conditions when all luns become highly active.  Essentially the lun queue_depth is set 
too high for the number of luns exported from storage to the host vs the activity level 
of the system.  

Shared storage servicing multiple hosts can increase the likelihood of this type of 
error status.

Retries for io returning queue full (or busy), is a delayed retry to try and let 
storage recover.

More/Next Steps:
Examine the current value of queue_depth within /sys/block/sd*/device/queue_depth.  
Determine the number of luns from storage to this host (if non-shared storage) or the 
total number of luns exported to all hosts.  Reduce the queue_depth.

Recommendations:

1. Reduce the lun queue depth to avoid storage getting into a queue full condition.
2. Examine and possibly reduce the lun queue depth on other hosts sharing the
   same storage depending on load.

For the DID_ERROR:

NOTE: DID_ERROR can be set due storage returning a queue full condition but not 
setting the rest of the response frame information correctly.  This can result
in the driver detecting a mismatch in response frame data resulting in a DID_ERROR.

If the DID_ERROR needs further examination, either before or after addressing the
queue full condition within storage, then you can try the following steps.

1. Turn on extended driver logging to try and ascertain which specific DID_ERROR is 
   being triggered.  {See additional information, DID_ERROR may be a consequence of 
   queue full plus storage not fully setting up the response frame in fibre channel 
   environments.}
2. review messages file after queue full/did error is logged.

Also check the driver for updates within processing of storage queue full handling and 
firmware detected underrun conditions.  If the qla2xxx driver, see bugzilla 805280.  
Other FC drivers might need the same storage workaround implementation.

NOTE: even with the driver workaround to prevent DID_ERRORs and immediate retries, 
without lowering lun queue_depth the system can still encounter QUEUE FULL conditions.  
Storage may not be able to recover once in a queue full condition depending on the 
number of hosts sharing the storage.  The only sure cure is to lower lun queue depth 
to avoid over driving storage in the first place.
 

 


00.0D.00.00 DID_REQUEUE


 

Raw


0x00.0D.00.00
           00   status byte : <{likely} not valid, see other fields>
        00         msg byte : <{likely} not valid, see other fields>
     0D           host byte : DID_REQUEUE -  Requeue command (no immediate retry) also w.o
                                             decrementing the retry count {RHEL5/RHEL6 only}
  00            driver byte : <{likely} not valid, see other fields>


===NOTES===================================================================================

          o 00.0D.00.00 DID_REQUEUE
                + From the original upstream patch that added DID_REQUEUE:
                  "We have a DID_IMM_RETRY to require a retry at once, but we could do with
                   a DID_REQUEUE to instruct the mid-layer to treat this command in the
                   same manner as QUEUE_FULL or BUSY (i.e. halt the submission until
                   another command returns ... or the queue pressure builds if there are no
                   outstanding commands)."
                + So, REQUEUE is just essentially a delayed retry... rather than immediately
                  resubmitting the io, the io is requeued onto the request queue and has to
                  drain down to the driver and out to storage only after some current
                  outstanding io completes.

---------------------------------------------------------------------------------------------------

Aug  1 08:10:02 hostname kernel: sd 1:0:0:5: SCSI error: return code = 0x000d0000

Advise to apply the following errata, fixed in 5.6 and later.
http://rhn.redhat.com/errata/RHSA-2011-0017.html

This is a known issue and further described in the following BZ when using lpfc driver.
Bug 627836 - retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath

The kbase article which explains the issue:

0x000d0000 (DID_REQUEUE) SCSI error with Emulex/LPFC driver on RHEL 5

Bug 516303 [Emulex 5.7 bug] lpfc: setting of DID_REQUEUE conditions Bug 627836 retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath

 


00.0F.00.00 DID_TRANSPORT_FAILFAST


 

Raw


0x00 0F 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     0F           host byte : DID_TRANSPORT_FAILFAST - transport class fastfailed the io
  00            driver byte : {likely} not valid, see other fields;

=== NOTES ===================================================================================

        + There were transport issues resulting in an inability to communicate or
          send io commands to the target.  For example, within an FC/SAN environment,
          the remote port the io is to be sent to is in the BLOCKED rather than ONLINE
          state.

        + recommend check for link down or high error rates on fibre/parallel scsi.
          network links (iscsi) as these are common causes for this error being
          returned.  That is, this error is hardware status based.

{Transport failfast will only occur if the option is turned on, normally transportation issues 
are DID_TRANSPORT_DISTURBED or similar -- and as such are re-tryable io...but here the 
retrys are suppressed in the interest of getting the io back up the io stack... usually 
because there is another path to try or lvm mirroring is in use and there is another mirror 
that can be rapidly tried. The failfast option is enabled by users, it is not on by default.}

Example:
Dec  7 16:08:38 hostname kernel: sd 1:0:1:0: Unhandled error code
Dec  7 16:08:38 hostname kernel: sd 1:0:1:0: SCSI error: return code = 0x000f0000
Dec  7 16:08:38 hostname kernel: Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK,SUGGEST_OK
Dec  7 16:08:38 hostname kernel: device-mapper: multipath: Failing path 8:96.
Dec  7 16:08:38 hostname kernel: sd 1:0:1:0: Unhandled error code
Dec  7 16:08:38 hostname kernel: sd 1:0:1:0: SCSI error: return code = 0x000f0000
Dec  7 16:08:38 hostname kernel: Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK,SUGGEST_OK


Summary:
The FAILFAST option is enabled so if/when a error is encountered, the io is immediately returned
up the io stack rather then being retried (as it normally would be).

Description:

More/Next Steps:
Gather extended logging information from driver.
Does the problem follow the HBA if it is swapped with another one?
Turn on driver debug/verbose logging to see if more specific information is available.


Recommendations:
Engage storage vendor support, likely storage side issue present.

----------------------------------------------------------------------------------------------
FAILFAST is set within the transport handlers within the kernel as seen here:
----------------------------------------------------------------------------------------------

C symbol: DID_TRANSPORT_FAILFAST

  File                    Function                Line
0 libiscsi2.c             <global>                1453 sc->result = DID_TRANSPORT_FAILFAST << 16;
1 scsi_transport_iscsi2.c iscsi2_session_chkready  351 err = DID_TRANSPORT_FAILFAST << 16;
2 scsi_transport_fc.h     fc_remote_port_chkready  689 result = DID_TRANSPORT_FAILFAST << 16;

For example within the fc transport code:

/**
 * fc_remote_port_chkready - called to validate the remote port state
 *   prior to initiating io to the port.
 *
 * Returns a scsi result code that can be returned by the LLDD.
 *
 * @rport:      remote port to be checked
 **/
static inline int
fc_remote_port_chkready(struct fc_rport *rport)
{
        int result;

        switch (rport->port_state) {
:  
        case FC_PORTSTATE_BLOCKED:
                if (rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)
                        result = DID_TRANSPORT_FAILFAST << 16;
                else
                        result = DID_IMM_RETRY << 16;
                break;

So, if the portstate is blocked and FAILFAST is off, the code does immediate retries but
if the option is enabled, then a FAILFAST status is returned and the io is not sent.


 


00.11.00.18 RESERVATION CONFLICT + DID_NEXUS_FAILURE


 

Raw


0x00.11.00.18
           18   status byte : SAM_STAT_RESERVATION_CONFLICT - device reserved to another HBA,
                                                              command failed
        00         msg byte : <{likely} not valid, see other fields>
     11           host byte : {RHEL5/6} DID_NEXUS_FAILIRE - permanent nexus failure, retry on
                                                            other paths may yield different
                                                            results
  00            driver byte : <{likely} not valid, see other fields>


===NOTES===================================================================================


    o 00.11.00.00 DID_NEXUS_FAILURE             - added into scsi kernel code in 5.8
    o 00.00.00.18 SAM_STAT_RESERVATION_CONFLICT - device reserved on another I_T nexus

      + the primary issue is the reservation conflict
      + a reservation conflict will not get better with retries, so the 
        DID_NEXUS_FAILURE is added to the io return code as a result of the
        primary problem: the reservation conflict
      + DID_NEXUS_FAILURE will prevent retrying this command on the current
        path.  Since reservations are typically I_T nexus specific, a different
        path with different I_T (initiator, aka hba/target aka storage port combo) 
        may succeed.  However, the command stands no chance of completing on the
        current path due to the reservation conflict.
      + recommend having customer check configuration to determine what has an
        outstanding reservation on the device.  For example, a failure within
        3rd party tape backup may have left an errant reservation on a tape drive
        (reservation conflicts often most associated with tape devices)

--------------------------------------------------------------------------------------------

Example:

st2: Error 110018 (sugg. bt 0x0, driver bt 0x0, host bt 0x11).

Summary:

Red Hat Enterprise Linux does not use scsi reservations, so they are being applied
by an application or if in hypervisor such as VMware then possibly by the hypervisor and/or
cluster management software.

A reservation conflict scsi status has been returned by the storage device.  The kernel
has also marked the io with DID_NEXUS_FAILURE to prevent retrying the io on the current
path.

Description:

More/Next Steps:
+ customer needs to check configuration and determine why reservation conflict exists
  and how to correct it. Reservations are typically associated with 3rd party applications.
+ can run sg_tur to see what the status is, often the reservation conflict will show up
  as a return status.
  -  ls -1c /dev/sd*[!0-9] | sort | xargs -I {}  sg_turs -vv  {}    << disks only
  -  ls -1c /dev/sg*       | sort | xargs -I {}  sg_turs -vv  {}    << all scsi devices
+ can attempt to see who has the reservation using sg3_utils command sg_persist:
  -  ls -1c /dev/sd*[!0-9] | sort | xargs -I {}  sg_persist --in -vv -k -d  {}  << disks only
  -  ls -1c /dev/sg*       | sort | xargs -I {}  sg_persist --in -vv -k -d  {}  << all scsi devices

Recommendations:
+ customer side issue, needs to review system, applications, and storage to ascertain
  why reservations are present.



--------------------------------------------------------------------------------------------

DID_NEXUS_FAILURE was added in RHEL5.8 and is present there and in later releases.

In 6.4, DID_NEXUS_FAILURE only shows up in a couple of places.

  File          Function                    Line
0 scsi.h                             433 #define DID_NEXUS_FAILURE 0x11
1 scsi_error.c  scsi_decide_disposition     1585 set_host_byte(scmd, DID_NEXUS_FAILURE);
2 scsi_lib.c    __scsi_error_from_host_byte  687 case DID_NEXUS_FAILURE:
3 virtio_scsi.c virtscsi_complete_cmd        148 set_host_byte(sc, DID_NEXUS_FAILURE);

The primary code for the case of reservation conflict/nexus failure is here:

1 scsi_error.c  scsi_decide_disposition     1585 set_host_byte(scmd, DID_NEXUS_FAILURE);
- - - - - - - - - - - - - - - - - - - - - - 
        case RESERVATION_CONFLICT:
                sdev_printk(KERN_INFO, scmd->device,
                            "reservation conflict\n");
                set_host_byte(scmd, DID_NEXUS_FAILURE);
                return SUCCESS; /* causes immediate i/o error */

In other words, if we get a reservation conflict, add DID_NEXUS_FAILURE before returning 
the io up the io stack.  The nexus failure is to cause an immediate io error and prevent
retries on the current device (path) as these will fail until existing reservation removed.


 


06.00.00.00 DRIVER_TIMEOUT


 

Raw


0x06 00 00 00
           00   status byte : {likely} not valid, see other fields;
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields;
 06            driver byte : DRIVER_TIMEOUT

=== NOTES ===================================================================================

        + Commands timing out.  Unless these are continuous, likelihood is that retries
          succeeded.  If that is the case (sporadic timeouts logged), then not major
          issue.

        + Storage didn't complete the io within the currently set timeout period.

        + If timeouts are followed by lun, target, bus, or adapter resets then storage
          issue is fairly serious and storage hardware vendor should be engaged.  Essentially
          these steps are taken when communication to storage is failing for some reason.

        + Different likely causes depending on how many and how often timeouts are being
          logged.  For example, groups or bursts of these across multiple devices 
          indicates a likely storage controller periodic overload -- questions to 
          answer: is storage shared across multiple hosts or just used by this one system 
          (this is not lun sharing, but sharing of the storage nports themselves... something 
          many sysadmins may not know the answer to without engaging their san storage folks).
          Collect iostat/vmstat and maybe blktrace data and review/compare/contrast 
          data before these are logged to during being logged.  Look at driver queue 
          depth -- is storage likely being overdriven?  Engage storage vendor to 
          look at in-controller statistics during reported timeout periods.  The key 
          here is that the timeouts are sporadic and come in bursts... if the bursts 
          are within hours to days or weeks between that implies a controller overdriven issue 
          to be investigated.  Reducing the lun queue depth within the driver can be 
          one way of determining if overdriven storage queues is responsible... but 
          this can impact both latency and throughput which is why a good baseline of 
          iostat/vmstat/blktrace data is important.  Also, grab top -- is there an 
          application that is always being run when the problem occurs vs not occurring? 

        + Its possible the device no longer exists on the SAN but for some reason an 
          RSCN or other notification isn't happening.  In this case the timeouts will 
          be constant against a given device or set of devices.  The timeout will occur 
          anytime the device is accessed.

        + Its possible that virtual nport mapping is in play and the switch has lost 
          the mapping.  For example, Cisco switches have the capability to map hba 
          nport 'ABC' to 'XYZ' so that the storage controller always sees commands 
          coming from 'XYZ' and not 'ABC'.  This allows replacement of an hba, 
          with a new nport identifier say of 'DEF'.  With virtual port 
          mapping you only need to update the map in the switch and not have to go 
          to all the storage controller ports and update them to accept commands from 
          'DEF'... the switch just remaps 'DEF' to 'XYZ' - done.  However, there have 
          been cases where the map is lost during maintenance cycles, map merges being 
          the most common place.  The result is the commands are sent, but without the 
          port mapping, the data/status/results are dropped by the switch resulting 
          in constant timeouts.


Example:
kernel: SCSI error : 0 0 0 4; return code = 0x6000000       

Summary:
Command has timed out.

Description:
The IO command was sent to the target, but neither a successful nor unsuccessful status 
has been returned within the expected timeout period.

More/Next Steps:
This is an error being returned back from the driver. But its not the driver's fault. In
99% of the cases this is a storage related issue. The command was sent, but no response 
was received back from storage within the allotted time (io timeout period). 

The default device timeout period is controlled by udev rules
50-udev.rules::
"
# sd:           0 TYPE_DISK, 7 TYPE_MOD, 14 TYPE_RBC
# sr:           4 TYPE_WORM, 5 TYPE_ROM
# st/osst:      1 TYPE_TAPE
# sg:           8 changer, [36] scanner
ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|14", \
        RUN+="/bin/sh -c 'echo 60 > /sys$$DEVPATH/timeout'"
ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="1", \
        RUN+="/bin/sh -c 'echo 900 > /sys$$DEVPATH/timeout'"
"

...so 60s for disks, cdroms, dvd, and 900s for tapes.  For any other devices the default
is 30s (as applied within the kernel at initial device setup).  Other udev rules or changes
to the above can alter the default timeout value.

Command timeouts are not necessarily bad, especially only if they happen occasionally. 
They become a problem when they occur more often or systemically in that they can impact users.

Different likely causes are dependent upon how many and how often this status is returned. 
For example, groups or bursts of these across multiple devices indicates a likely storage 
controller periodic overload or other storage controller issue (recovery, configuration, 
loads from other attached systems) or possibly temporary san/switch congestion. Some 
questions to answer include:

        * is the storage shared with other systems or just used by this system (this is not 
          lun sharing, but sharing of the storage nports themselves... something many customers 
          won't necessarily know the answer to without engaging their san storage support
          folks).
        * has anything been changed within the system or storage lately?
        * how long has the problem been going on?
        * what is its frequency?
        * what effect does this have on your system and its users? 


Attempt a dd command to the sd device(s) in question is one way to ascertain if the device 
is dead-dead (aka possibly deleted in storage or virtual port mapping issues) or just mostly 
or occasionally dead (congestion).  If dead-dead, then no io commands will complete.  

Rescan storage or use sg_luns to determine if luns are still present/available.


Recommendations:
1. check the io timeout value, is it set too short?
2. is lun queue depth set too high for the storage configuration? reduce for testing.
3. have the customer engage their storage vendor,
4. gather baseline data from when the problem is and is not occurring, baseline loading
   information may need to be gathered (iostat, vmstat, top)
5. turn on additional kernel messages such as extended scsi and/or driver logging, and/or
6. use a test load script and gather data while trying to induce and reproduce the issue
7. check hba port statistics in sysfs and have storage admin check similar counters within
   switch.


1. io timeout value
Check current timeout values, and if it seems appropriate, increase the timeout value
to allow storage enough time.  Typical io timeout values are 30 or 60 seconds.  If
increasing the timeout value beyond 120 seconds, make sure the task stall logic timer
is also reset (its default is 120 and if io is allowed to timeout at, say, 150s, then
false positives can be generated.  Nominally setting task stall detect at 2x io timeout
is a good place to start.)  Timeout values in excess of 120 seconds are sometimes used
within virtual machine environments to compensate for increased latency on shared 
platform hardware.

For example, if the current io timeout is set to 20s, then
setting it 60s may provide relief of temporary storage congestion issues:

echo 60  > /sys/block/sdX/device/timeout
echo 120 > /proc/sys/kernel/hung_task_timeout_secs

2. lun queue depth
The default lun queue depth, typically 30-32 for fibre channel storage adapters, can be
too high for the storage configuration if there are a lot of luns configured and/or the
storage is shared with other hosts.

Some lun queue depths can be set on-line via /sys.  Note in the following example the
queue_depth has read/write access allowing it to be set without reloading the driver.

# ls -l /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth
-rw-r--r--. 1 root root 4096 Aug 13 09:55 /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/ 
                                               target0:0:2/0:0:2:0/queue_depth
# cat /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth
32
# echo 16 > /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth
# cat /sys/devices/pci0000:00/0000:00:05.0/0000:1f:00.0/host0/rport-0:0-2/target0:0:2/0:0:2:0/queue_depth
16

However, the preferred method is to set the default within the driver via config options files
so the value is picked up at boot time. See 

"What is the HBA queue depth, how to check the current queue depth value and how to change the value?"

for instructions on changing lun queue depth on lpfc and qla2xxx drivers. 3. vendor support Ultimately the storage hardware vendor may need to be engaged to ascertain the root cause of io timeouts if they continue or are frequent. If a hardware vendor ticket is opened, post the ticket number from the vendor in any Red Hat case to allow us to engage with the vendor if need be. 4. baseline If the problem is not continuous, the baseline data of iostat/vmstat/top will be needed if further analysis is desired or possible. See

"[Troubleshooting] Gathering system baseline resource usage for IO performance issues"

for a script that can be used to gather system resource information for review. Data should be gathered and submitted both from time periods when the problem is not occurring and when it is. Typically about 1 hour of data each is a reasonable amount to review. 5. turn on extended logging This step can be used if the storage vendor doesn't find anything and is looking for additional information/cooperation on the issue. Typically extended logging for timeout issues won't reveal much more in terms of information other than the io timed out within storage. See one of the following for appropriate information/instructions:

6. test load Induce a high io load similar to ones observed while the problem is happening from data collected in "4. baseline" above. Collect iostat/vmstat/top data for review. This is pretty much a last step attempt at inducing the problem to allow studying the issue and what triggers it plus if any steps above, like reducing lun queue depth, mitigates the issue. Most times it can be difficult to actually induce timeouts within storage via a simple load increase. 7. check port statistics The hba port statistics, if supported by the driver/hba, are available in /sys/class/fc_host/host*/statistics/*. See

"What do the fc statistics under /sys/class/fc_host/hostN/statistics mean"

for details.

 


06.00.00.08 DRIVER_TIMEOUT + {SCSI} BUSY


 

Raw


0x06 00 00 08
           08   status byte : SAM_STAT_BUSY - device {returned} busy {status};
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields; 
  06            driver byte : DRIVER_TIMEOUT

=== NOTES ===================================================================================


    o 06.00.00.00 DRIVER_TIMEOUT
    o 00.00.00.08 SAM_STAT_BUSY - device {returned} busy {status}

      + commands are timing out due to scsi busy status being returned.
      + typically returned by storage target (scsi busy status), but in some
        cases it is "manufactured" by the driver because of hba-based issues.
        o for example, the mptscsih returns this both if the device returns
          this status OR the hba returns busy or insufficient resources (too busy
          with other stuff).  Either of these cases are very likely hardware
          based issues either in the storage device itself, or induced hba 
          issue due to storage device(s) or transport, or in the hba.
      + recommend checking hardware
      + recommend reviewing driver specifics as to if/where SAM_STAT_BUSY is assigned
        as status (typically not done, but can be driver specific).

--------------------------------------------------------------------------------------------

Example:
kernel: sd 0:0:1:0: timing out command, waited 360s
kernel: sd 0:0:1:0: SCSI error: return code = 0x06000008
kernel: end_request: I/O error, dev sdb, sector 10564536
kernel: Aborting journal on device dm-4.
kernel: ext3_abort called.
kernel: EXT3-fs error (device dm-4): ext3_journal_start_sb: Detected aborted journal
kernel: Remounting filesystem read-only

Summary:
The SAM_STAT_BUSY is the scsi status code back from the target (disk, device).  Typically
a device busy should be a transitory issue that corrects itself in time.  If it doesn't, 
then essentially this is a device timeout condition -- the device failed to respond within
a reasonable amount of time over some number of retries.

Description:
From the SCSI spec:

Hex     Description
08      BUSY               Indicates the target is busy. Returned whenever a target is 
                           unable to accept a command from an otherwise acceptable 
                           initiator.

So, in this case we're getting back constant device busy status from storage and cannot 
get the command completed resulting in eventually giving up as a timeout condition.

More/Next Steps:
This above is very common; a filesystem journal write couldn't complete so the filesystem 
has no choice but to remount the filesystem readonly to protect filesystem integrity.  The 
timeout in this case was 360s or 6 minutes!  This is no short term congestion issue but 
some type of non-responsive hardware issue -- need to check hardware which is a common 
theme for any timeout issues.

Recommendations:
Engage storage h/w support to determine cause of device busy/io timeouts.

If the problem is logged rarely and the system continues, then these can mostly be
ignored.  The issue is due to a temporary congestion issue which clears quickly.

If they are frequent to the point that io fails due to constant device busy status, or
the busy status is being logged frequently, then engage your storage hardware support group
to determine root cause of storage hardware returning device busy status.  If the issue
is due to storage load, reducing the lun queue depth may provide some relief by lowering
the overall io load placed on storage by the host.  

 


06.0D.00.00 DRIVER_TIMEOUT + DID_REQUEUE


 

Raw


0x06.0D.00.00
           00   status byte : <{likely} not valid, see other fields>
        00         msg byte : <{likely} not valid, see other fields>
     0D           host byte : DID_REQUEUE -  Requeue command (no immediate retry) also w.o
                                             decrementing the retry count {RHEL5/RHEL6 only}
  06            driver byte : DRIVER_TIMEOUT


===NOTES===================================================================================


    o 06.0D.00.00 DRIVER_TIMEOUT + REQUEUE
        + Commands timing out and are being requeued for retry.  REQUEUE is slightly
          different than an immediate retry effort in that the requeue means the io
          can be delayed before it retried.  Also REQUEUE does not decrement the
          retry count (so this didn't get counted against total maximum retries).
        + typical transient in nature, if not happening a lot or constantly, then
          io is succeeding after being requeued.
        + see 

0x06000000

and/or

0x000D0000

for more information on timeouts and requeue, respectively. -------------------------------------------------------------------------------------------- 0x060d0000 means driver timeout with requeue. If there are no subsequent errors for the same device and they're running RHEL5.6 or later, then the i/o will have succeeded with a retry. Example: kernel: sd 3:0:0:15: timing out command, waited 30s kernel: sd 3:0:0:15: SCSI error: return code = 0x060d0000 << return code. kernel: Result: hostbyte=DID_REQUEUE driverbyte=DRIVER_TIMEOUT,SUGGEST_OK Summary: Command has timed out, being requeued rather than immediately retried. Description: The IO command was sent to the target, but neither a successful nor unsuccessful status has been returned within the expected timeout period. If no subsequent events for the same device are logged, then the io has succeeded upon being retried. More/Next Steps: This is an error being returned back from the driver. But its not the driver's fault. In 99% of the cases this is a storage related issue. The command was sent, but no response was received within the allotted time. Additional steps that could be taken would be to 1. check the io timeout value, is it set too short? 2. have the customer engage their storage vendor, 3. gather baseline data from when the problem is and is not occurring, 4. turn on additional kernel messages such as extended scsi and/or driver logging, and/or 5. baseline loading information may need to be gathered (iostat, vmstat, top) Recommendations: Check current timeout values, and if it seems appropriate, increase the timeout value to allow storage enough time. Typical io timeout values are 30 or 60 seconds. If increasing the timeout value beyond 120 seconds, make sure the task stall logic timer is also reset (its default is 120 and if io is allowed to timeout at, say, 150s, then false positives can be generated. Nominally setting task stall detect at 2x io timeout is a good place to start.) For example, if the current io timeout is set to 20s, then setting it 60s may provide relief of temporary storage congestion issues: echo 60 > /sys/block/sdX/device/timeout echo 120 > /proc/sys/kernel/hung_task_timeout_secs Request step 2 and get the vendor case number posted within the ticket. We'll need this when engaging the vendor from our side of things -- if it comes to that. Request step 3 if the problem is not continuous, the baseline data of iostat/vmstat/top will be needed to make further analysis possible. Request step 4 only if the storage vendor doesn't find anything and is looking for additional information/cooperation on the issue. Offer step 5, as a means of collecting baseline data against a known load condition.

 


08.00.00.02 DRIVER_SENSE + {SCSI} CHECK_CONDITION


 

Raw


0x08 00 00 02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : {likely} not valid, see other fields;
     00           host byte : {likely} not valid, see other fields;
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}

===NOTES===================================================================================


        + status indicates command was returned by the target (disk in this case),
          with a scsi CC (Check Condition) status byte.  The driver byte indicates
          that a sense buffer is available for this command and should be consulted
          for sense key as well as asc/ascq information from the target that will 
          provide more information on the issue of why the io wasn't completed 
          successfully\,  Often this sense buffer information is already decoded and
          output within the messages file.  For example an 04/02 asc/ascq combination
          means "Not Ready, manual intervention required" and can show up within
          the messages this way (already interpreted).  The sense key may also
          be interpreted and displayed, as in:
              "kernel: sdas: Current: sense key: Aborted Command"
          The important thing to note is that this information is coming from the 
          storage target vs from the kernel or its driver.
 
        + sense buffer includes three key pieces of information:
            - sense key
                - additional sense code (ASC)
        - additional sense qualifier (ASCQ)
          The codes within these three fields are defined by the scsi standard, although
          some value ranges are defined for use by vendors and vendor unique/specific
          codes.  This can make interpretation of the data difficult.

        + the '02' is scsi status returned from the target device, within the sense 
          buffer was additional information that should indicate what condition is 
          being reported. 
                # sense key: Aborted Command, if this is reported then the sense 
                             key within the sense buffer is Bh
                # sense key: Unit Attention, if this is reported then the sense key 
                             within the sense buffer is 6h and there will be additional 
                             information within the sense buffer asc/ascq fields that 
                             will be reported in the messages log file\,
                # See https://access\,redhat\,com/kb/docs/DOC-64554 for more information 
                on sense keys and sense buffers in general.
                # If the sense key was aborted command, then this means that the 
                target aborted the command based upon a request from the initiator. 
                This is not necessarily an error condition that you need to be concerned 
                about -- but you do need to ascertain why the kernel was requesting 
                the command to be aborted\,  Common reasons include 

                * temporary loss of transport, for example link down/link up -- 
                     upon returning all outstanding commands are aborted because the
                     kernel doesn't know if it missed a response while the transport was down


Example:
Jun 29 19:07:55 hostname kernel: sd 0:0:0:106: SCSI error: return code = 0x08000002
Jun 29 19:07:55 hostname kernel: sdf: Current: sense key: Aborted Command             < sense key = Bh
Jun 29 19:07:55 hostname kernel:     Add. Sense: Internal target failure
Jun 29 19:07:56 hostname kernel: end_request: I/O error, dev sdf, sector 41143440

May 30 17:16:30 hostname kernel: sd 1:0:0:14: SCSI error: return code = 0x08000002
May 30 17:16:30 hostname kernel: sdp: Current: sense key: Aborted Command       < sense key = Bh
May 30 17:16:30 hostname kernel:     <<vendor>> ASC=0xc0 ASCQ=0x0ASC=0xc0 ASCQ=0x0
May 30 17:16:30 hostname kernel: end_request: I/O error, dev sdp, sector 13250783


Dec 21 06:37:00 hostname kernel: sd 2:0:0:287: SCSI error: return code = 0x08000002
Dec 21 06:37:00 hostname kernel: sdfq: Current: sense key: Hardware Error             < sense key = 4h
Dec 21 06:37:00 hostname kernel:     Add. Sense: Internal target failure
Dec 21 06:37:00 hostname kernel:
Dec 21 06:37:00 hostname kernel: end_request: I/O error, dev sdfq, sector 81189199

Mar 21 16:41:14 hostname kernel: sd 4:0:2:29: SCSI error: return code = 0x08000002
Mar 21 16:41:14 hostname kernel: sdkv: Current: sense key: Not Ready              < sense key = 2h
Mar 21 16:41:14 hostname kernel:     Add. Sense: Logical unit not ready, manual intervention required

Jun 11 17:11:26 hostname kernel: sd 3:0:1:0: SCSI error: return code = 0x08000002
Jun 11 17:11:26 hostname kernel: sdd: Current: sense key: Illegal Request          < sense key = 5h
Jun 11 17:11:26 hostname kernel:     <<vendor>> ASC=0x94 ASCQ=0x1ASC=0x94 ASCQ=0x1
Jun 11 17:11:26 hostname kernel: Buffer I/O error on device sdd, logical block 0


There are 16 different scsi sense key combinations of which 7 are not likely or reserved values.  The 9 
codes you might encounter are:

Sense Key
0h           NO SENSE.  
1h           RECOVERED ERROR.
2h           NOT READY.
3h           MEDIUM ERROR.
4h           HARDWARE ERROR.
5h           ILLEGAL REQUEST.
6h           UNIT ATTENTION.
7h           DATA PROTECT.
Bh           ABORTED COMMAND.

In each of the above cases there is a different problem present.  For example in the hardware
error case, the disk had failed.  In the not ready case, the io was being sent to a passive
path.

The ASC/ASCQ codes are decoded when they are scsi standard defined, otherwise if they are vendor
specific codes then they are output with "vendor" lines as shown above.


Summary:
Device has returned back a CHECK CONDITION (CC) SCSI status and a sense buffer.

Description:
Since the scsi command status returned from the target is BUSY status, this is truly device is 
busy problem and not a transport issue. Sometimes a device goes busy when the storage
controller is reset or if the device itself is undergoing an internal reset and has not
finished yet. Usually retries will ride through these busy times and eventually complete ok. If
not, then the target device/storage controller should be reviewed as to why this issue is
occurring. The IO command has encountered some type of issue within target device resulting in
the command completing but not successfully. The driver byte (0x08) indicates that a sense
buffer from the target device *IS* available. The sense buffer will have a sense key,
additional sense code (ASC) and additional sense code qualifier (ASCQ) that more fully explain
the issue.

More/Next Steps:
This is an error being returned back from the target and usually indicates a target failure 
of some type. Please refer to the sense key and ASC/ASCQ lines within messages for more 
details. For example, if this is an aborted command that was requested because of a timeout, 
then the root cause is the timeout which should be investigated further. If the abort was 
unsolicited by the host, then storage should be engaged to review and address the issue.

Recommendations:
1. have the customer engage their storage vendor,
2. turn on additional kernel messages such as extended scsi or driver logging, and/or
3. baseline loading information may need to be gathered (iostat, vmstat, top) 

 


08.07.00.02 DRIVER_SENSE + DID_ERROR + {SCSI} CHECK_CONDITION


 

Raw


0x08.07.00.02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : <{likely} not valid, see other fields>
     07           host byte : DID_ERROR - internal error
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}


===NOTES===================================================================================

    o 08.07.00.02 DRIVER_SENSE + DID_ERROR + {SCSI} CHECK_CONDITION 

--------------------------------------------------------------------------------------------

Example:

kernel: sd 1:0:0:2: SCSI error: return code = 0x08070002
kernel: sdab: Current: sense key: Medium Error
kernel:     Add. Sense: Unrecovered read error
kernel:
kernel: end_request: I/O error, dev sdab, sector 83235271
kernel: device-mapper: multipath: Failing path 65:176.
multipathd: dm-8: add map (uevent)
multipathd: dm-8: devmap already registered

...and another case...

kernel: sd 3:0:8:15: Unhandled sense code
kernel: sd 3:0:8:15: SCSI error: return code = 0x08070002
kernel: Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sdfu: Current: sense key: Aborted Command
kernel:     Add. Sense: Data phase error

Summary:
IO failing with medium error, or other target type failure against device.

Description:
The IO command is being failed with a device status of medium error in the above example. 
Medium errors are target based error and are not retriable.  Namely, if the disk media is bad
then there is no chance that trying down a different path will result in success -- if the
media is bad then transport path is immaterial.

The odd thing is the DID_ERROR in this case.  The DID_ERROR is set internal to the 
driver upon detecting an anomaly within the target provided status information. 
For example, getting a scsi SUCCESS status, not having data underrun set BUT! having the
residual byte count be non-zero.  In this case the driver questions the validity of all
information given that the various components of the status don't jibe with one another.  The
status is essentially saying it was successful but didn't transfer all the data requested...
that is not the definition of success.

More/Next Steps:
Engage storage vendor.  A media error appears to be present and needs to be addressed at the
hardware level.  Typically the disk(s) will need to be physically replaced in this case.

Review the specific driver being used to see under what circumstances the DID_ERROR status
is set and returned to help better understand the circumstances of the reported error event.
The DID_ERROR is more of a curiosity than anything else.  A review of when DID_ERROR is set
within the qla2xxx and lpfc drivers found no cases where DID_ERROR was set AND the scsi status,
sense key, or asc/ascq codes were modified.  That is the scsi CHECK CONDITION, medium error
sense key (3h) and unrecoverable read error asc/ascq code (11h/00h) are all from the target
device. Ditto for the data phase error.

-------------------------------------------------------------------------------------------

The DID_ERROR is the only thing troubling in this case, its clearly an issue that the
device has a media error and thus the disk needs to be physically replaced.  The DID_ERROR
is likely an artifact/secondary issue due to the primary issue (whatever the check condition
is for).

 


08.10.00.02 DRIVER_SENSE + TARGET_FAILURE + {SCSI} CHECK_CONDITION


 

Raw


0x08.10.00.02
           02   status byte : SAM_STAT_CHECK_CONDITION - check returned sense data, esp. 
                                                         ASC/ASCQ
        00         msg byte : {likely} not valid, see other fields
     10           host byte : {RHEL5/6} DID_TARGET_FAILURE - permanent target failure, do not
                                                             retry other paths {set via sense
                                                             info review}
  08            driver byte : DRIVER_SENSE {scsi sense buffer available from target}

===NOTES===================================================================================

    o 08.10.00.02 DRIVER_SENSE + DID_TARGET_FAILURE + {SCSI} CHECK_CONDITION 

Example:

kernel: sd 6:0:2:0: SCSI error: return code = 0x08100002
kernel: Result: hostbyte=invalid driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sde: Current: sense key: Medium Error
kernel:     Add. Sense: Record not found

Summary:
The IO is failing with sense information that flags the device as in a permanent error state.

Description:
One of the following sense key and asc/ascq combos are returned by the target causing
the DID_TARGET_FAILURE host byte to be set.  The hostbyte decode table within constants.c
is currently missing the DID_TARGET_FAILURE within the decode table which is why you might
see output in messages of "hostbyte=invalid" as in above. BZ has been opened against that issue.

A DID_TARGET_FAILURE means the current error is considered a permanent target failure and
no other retries on other paths will be attempted.

DID_TARGET_FAILURE set under the following circumstances:

1.  Only certain scsi sense keys are processed by the scsi stack.  If any of the 
    following scsi sense keys are returned, then the device is considered dead
    and a hostbyte of DID_TARGET_FAILURE is set.

        . key 7h, DATA PROTECT.
        . key 8h, BLANK CHECK.
        . key Ah, COPY ABORTED.
        . key Dh, VOLUME OVERFLOW.
        . key Eh, MISCOMPARE.

        From the SCSI specification, these sense keys are describes as follows.

        Sense Key
        7h           DATA PROTECT.  Indicates that a command that reads or writes the medium was
                     attempted on a block that is protected from this operation.  The read or
                     write operation is not performed.

        8h           BLANK CHECK. Indicates that a write-once device or a sequential access
                     device encountered blank medium or format-defined end-of-data indication
                     while reading or a write-once device encountered a non-blank medium while
                     writing.

        Ah           COPY ABORTED.  Indicates a COPY, COMPARE, or COPY AND VERIFY command was
                     aborted due to an error condition on the source device, destination
                     device, or both.

        Dh           VOLUME OVERFLOW.  Indicates that a buffered peripheral device has reached
                     the end-of-partition and data may remain in the buffer that has not been
                     written to the medium.  A RECOVERED BUFFERED DATA command(s) may be issued
                     to read the unwritten data from the buffer.

        Eh           MISCOMPARE.  Indicates that the source data did not match the data read
                     from the medium.

2. The sense key is MEDIUM ERROR (3h) plus any of the following additional sense code (asc) is
   set within the sense buffer:

        . 0x11/xx - Unrecovered read error
        . 0x13/xx - AMNF data field
        . 0x14/xx - record not found {the specific code within the example above} 

3. Set if HARDWARE ERROR sense key (4h) and there is no retries allowed for hardware
   errors per the scmd->device->retry_hwerror counter.


More/Next Steps:
Engage storage vendor.  The root cause is either an odd sense key from the target is being returned (see #1 above), a medium error/asc combo per #2 or a hardware error and no retries on
hardware errors is allowed for this device (which is typical) per #3.

Typically the cause of the odd sense, media error or hardware error from the device(s) will
need to be addressed.

Recommendations:
Engage storage h/w support.

 

References

See http://tldp.org/HOWTO/archived/SCSI-Programming-HOWTO/SCSI-Programming-HOWTO-21.html for additional background information.

 

Resources

http://osdir.com/ml/scsi/2003-01/msg00364.html

猜你喜欢

转载自blog.csdn.net/vic_qxz/article/details/84335685
今日推荐