9211 (LSI/SAS) issues on 11.2-STABLE

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

9211 (LSI/SAS) issues on 11.2-STABLE

Karl Denninger
I recently started having some really oddball things  happening under
stress.  This coincided with the machine being updated to 11.2-STABLE
(FreeBSD 11.2-STABLE #1 r342918:) from 11.1.

Specifically, I get "errors" like this:

        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
length 131072 SMID 269 Aborting command 0xfffffe0001179110
mps0: Sending reset from mpssas_send_abort for target ID 37
        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
c xfer 0
        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
c xfer 0
mps0: Unfreezing devq for target ID 37
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: Command timeout
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: SCSI Status Error
(da12:mps0:0:37:0): SCSI status: Check Condition
(da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da12:mps0:0:37:0): Retrying command (per sense data)

The "Unit Attention" implies the drive reset.  It only occurs on certain
drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
on two different brands of disk across multiple firmware and capacities,
however, which tends to point away from a drive firmware problem.

A look at the pool data shows /no /errors (e.g. no checksum problems,
etc) and a look at the disk itself (using smartctl) shows no problems
either -- whatever is going on here the adapter is recovering from it
without any data corruption or loss registered on *either end*!

The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:

mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

There is also a SAS expander connected to that with all but the boot
drives on it (the LSI card will not boot from the expander so the boot
mirror is directly connected to the adapter.)

Thinking this might be a firmware/driver compatibility related problem I
flashed the card to 20.00.07.00, which is the latest available.  That
made the situation **MUCH** worse; now instead of getting unit attention
issues I got *controller* resets (!!) which invariably some random
device (and sometimes more than one) in one of the pools to get
detached, as the controller didn't come back up fast enough for ZFS and
it declares the device(s) in question "removed".

Needless to say I immediately flashed the card back to 19.00.00.00!

This configuration has been completely stable on 11.1 for upwards of a
year, and only started misbehaving when I updated the OS to 11.2.  I've
pounded the living daylights out of this box for a very long time on a
succession of FreeBSD OS builds and up to 11.1 have never seen anything
like this; if I had a bad drive, it was clearly the drive.

Looking at the commit logs for the mps driver it appears there isn't
much here that *could* be involved, unless there's an interrupt issue
with some of the MSI changes that is interacting with my specific
motherboard line.

Any ideas on running this down would be appreciated; it's not easy to
trigger it on the 19.0 firmware but on 20. I can force a controller
reset and detach within a few minutes by running scrubs so if there are
things I can try (I have a sandbox machine with the same hardware in it
that won't make me cry much if I blow it up) that would great.

Thanks!

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Karl Denninger

On 2/2/2019 12:02, Karl Denninger wrote:

> I recently started having some really oddball things  happening under
> stress.  This coincided with the machine being updated to 11.2-STABLE
> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>
> Specifically, I get "errors" like this:
>
>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> length 131072 SMID 269 Aborting command 0xfffffe0001179110
> mps0: Sending reset from mpssas_send_abort for target ID 37
>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
> c xfer 0
> mps0: Unfreezing devq for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: Command timeout
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: SCSI Status Error
> (da12:mps0:0:37:0): SCSI status: Check Condition
> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da12:mps0:0:37:0): Retrying command (per sense data)
>
> The "Unit Attention" implies the drive reset.  It only occurs on certain
> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
> on two different brands of disk across multiple firmware and capacities,
> however, which tends to point away from a drive firmware problem.
>
> A look at the pool data shows /no /errors (e.g. no checksum problems,
> etc) and a look at the disk itself (using smartctl) shows no problems
> either -- whatever is going on here the adapter is recovering from it
> without any data corruption or loss registered on *either end*!
>
> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>
> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
After considerable additional work this looks increasingly like either a
missed interrupt or a command is getting lost between the host adapter
and the expander.

I'm going to turn the driver debug level up and see if I can capture
more information..... whatever is behind this, however, it is
almost-certainly related to something that changed between 11.1 and
11.2, as I never saw these on the 11.1-STABLE build.

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Karl Denninger
On 2/5/2019 09:22, Karl Denninger wrote:

> On 2/2/2019 12:02, Karl Denninger wrote:
>> I recently started having some really oddball things  happening under
>> stress.  This coincided with the machine being updated to 11.2-STABLE
>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>
>> Specifically, I get "errors" like this:
>>
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> length 131072 SMID 269 Aborting command 0xfffffe0001179110
>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
>> c xfer 0
>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
>> c xfer 0
>> mps0: Unfreezing devq for target ID 37
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: Command timeout
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>> (da12:mps0:0:37:0): SCSI status: Check Condition
>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>> reset, or bus device reset occurred)
>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>
>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>> on two different brands of disk across multiple firmware and capacities,
>> however, which tends to point away from a drive firmware problem.
>>
>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>> etc) and a look at the disk itself (using smartctl) shows no problems
>> either -- whatever is going on here the adapter is recovering from it
>> without any data corruption or loss registered on *either end*!
>>
>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>
>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> After considerable additional work this looks increasingly like either a
> missed interrupt or a command is getting lost between the host adapter
> and the expander.
>
> I'm going to turn the driver debug level up and see if I can capture
> more information..... whatever is behind this, however, it is
> almost-certainly related to something that changed between 11.1 and
> 11.2, as I never saw these on the 11.1-STABLE build.
>
> --
> Karl Denninger
> [hidden email] <mailto:[hidden email]>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/
Pretty decent trace here -- any ideas?

mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 Aborting command 0xfffffe00011b5020
mps0: Sending reset from mpssas_send_abort for target ID 37
mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm
0xfffffe00011812a0
mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm
0xfffffe00011812a0
mps0: EventReply        :
        EventDataLength: 2
        AckRequired: 0
        Event: SasDiscovery (0x16)
        EventContext: 0x0
        Flags: 1<InProgress>
        ReasonCode: Discovery Started
        PhysicalPort: 0
        DiscoveryStatus: 0
mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
mps0: (1)->(mpssas_fw_work) Event Free: [16]
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb
0xfffff8019458e000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 completed timedout cm
0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
01 00 00
020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state
c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb
0xfffff80054e86000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length
131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb
0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length
131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
        (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0
count 4
        (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting
TaskMID 634
mps0: Unfreezing devq for target ID 37
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: Command timeout
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
mps0: EventReply        :
        EventDataLength: 4
        AckRequired: 0
        Event: SasTopologyChangeList (0x1c)
        EventContext: 0x0
        EnclosureHandle: 0x2
        ExpanderDevHandle: 0x9
        NumPhys: 25
        NumEntries: 1
        StartPhyNum: 12
        ExpStatus: Responding (0x3)
        PhysicalPort: 0
        PHY[12].AttachedDevHandle: 0x0013
        PHY[12].LinkRate: 6.0Gbps (0xaa)
        PHY[12].PhyStatus: PHYLinkStatusChange
mps0: (1)->(mpssas_fw_work) Working on  Event: [1c]
mps0: (2)->(mpssas_fw_work) Event Free: [1c]
mps0: EventReply        :
        EventDataLength: 2
        AckRequired: 0
        Event: SasDiscovery (0x16)
        EventContext: 0x0
        Flags: 0
        ReasonCode: Discovery Complete
        PhysicalPort: 0
        DiscoveryStatus: 0
mps0: (2)->(mpssas_fw_work) Working on  Event: [16]
mps0: (3)->(mpssas_fw_work) Event Free: [16]
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
(da11:mps0:0:37:0): CAM status: SCSI Status Error
(da11:mps0:0:37:0): SCSI status: Check Condition
(da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da11:mps0:0:37:0): Retrying command (per sense data)

This is what's running right now that produced that -- note no errors
reported on the two providers (the degraded state is intentional; the
third unit is off-site in a bank vault):

[karl@NewFS /var/log]$ zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub in progress since Tue Feb  5 06:57:14 2019
        840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total
        0 repaired, 15.04% done, 0 days 06:44:17 to go
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/backup61.eli      ONLINE       0     0     0
            gpt/backup62-1.eli    ONLINE       0     0     0
            13282812295755460479  OFFLINE      0     0     0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

But, here's the SMART data on the disk -- it DOES show the reset:

Before....

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              19  ---  Lifetime Power-On Resets
0x01  0x018  6     32002138534  ---  Logical Sectors Written
0x01  0x020  6       155895325  ---  Number of Write Commands
0x01  0x028  6    265457016500  ---  Logical Sectors Read
0x01  0x030  6      1132601490  ---  Number of Read Commands
0x01  0x038  6     61537515300  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
0x03  0x010  4            6114  ---  Head Flying Hours
0x03  0x018  4           30602  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4          283735  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
Completio
n
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              38  ---  Current Temperature
0x05  0x010  1              40  N--  Average Short Term Temperature
0x05  0x018  1              32  N--  Average Long Term Temperature
0x05  0x020  1              52  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              48  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               6  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

And after....

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              19  ---  Lifetime Power-On Resets
0x01  0x018  6     32004195070  ---  Logical Sectors Written
0x01  0x020  6       155946732  ---  Number of Write Commands
0x01  0x028  6    266340052468  ---  Logical Sectors Read
0x01  0x030  6      1136347463  ---  Number of Read Commands
0x01  0x038  6     61540814700  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
0x03  0x010  4            6114  ---  Head Flying Hours
0x03  0x018  4           30602  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4          283848  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              41  ---  Current Temperature
0x05  0x010  1              40  N--  Average Short Term Temperature
0x05  0x018  1              32  N--  Average Long Term Temperature
0x05  0x020  1              52  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              48  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               7  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Number of Hardware Resets has incremented.  There are no other errors shown:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   137   137   054    -    104
  3 Spin_Up_Time            POS---   132   132   024    -    497
(Average 497)
  4 Start_Stop_Count        -O--C-   099   099   000    -    7074
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   098   098   000    -    17094
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    19
192 Power-Off_Retract_Count -O--CK   075   075   000    -    30602
193 Load_Cycle_Count        -O--C-   075   075   000    -    30602
194 Temperature_Celsius     -O----   146   146   000    -    41 (Min/Max
23/52)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

I'd throw possible shade at the backplane or cable /but I have already
swapped both out for spares without any change in behavior./

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Karl Denninger
BTW under 12.0-STABLE (built this afternoon after the advisories came
out, with the patches) it's MUCH worse.  I get the same device resets
BUT it's followed by an immediate panic which I cannot dump as it
generates a page-fault (supervisor read data, page not present) in the
mps *driver* at mpssas_send_abort+0x21.

This precludes a dump of course since attempting to do so gives you a
double-panic (I was wondering why I didn't get a crash dump!); I'll
re-jigger the box to stick a dump device on an internal SATA device so I
can successfully get the dump when it happens and see if I can obtain a
proper crash dump on this.

I think it's fair to assume that 12.0-STABLE should not panic on a disk
problem (unless of course the problem is trying to page something back
in -- it's not, the drive that aborts and resets is on a data pack doing
a scrub)

On 2/5/2019 10:26, Karl Denninger wrote:

> On 2/5/2019 09:22, Karl Denninger wrote:
>> On 2/2/2019 12:02, Karl Denninger wrote:
>>> I recently started having some really oddball things  happening under
>>> stress.  This coincided with the machine being updated to 11.2-STABLE
>>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>>
>>> Specifically, I get "errors" like this:
>>>
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> length 131072 SMID 269 Aborting command 0xfffffe0001179110
>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
>>> c xfer 0
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
>>> c xfer 0
>>> mps0: Unfreezing devq for target ID 37
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: Command timeout
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>>> (da12:mps0:0:37:0): SCSI status: Check Condition
>>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>>> reset, or bus device reset occurred)
>>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>>
>>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>>> on two different brands of disk across multiple firmware and capacities,
>>> however, which tends to point away from a drive firmware problem.
>>>
>>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>>> etc) and a look at the disk itself (using smartctl) shows no problems
>>> either -- whatever is going on here the adapter is recovering from it
>>> without any data corruption or loss registered on *either end*!
>>>
>>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>>
>>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>> After considerable additional work this looks increasingly like either a
>> missed interrupt or a command is getting lost between the host adapter
>> and the expander.
>>
>> I'm going to turn the driver debug level up and see if I can capture
>> more information..... whatever is behind this, however, it is
>> almost-certainly related to something that changed between 11.1 and
>> 11.2, as I never saw these on the 11.1-STABLE build.
>>
>> --
>> Karl Denninger
>> [hidden email] <mailto:[hidden email]>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
> Pretty decent trace here -- any ideas?
>
> mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 Aborting command 0xfffffe00011b5020
> mps0: Sending reset from mpssas_send_abort for target ID 37
> mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm
> 0xfffffe00011812a0
> mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm
> 0xfffffe00011812a0
> mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm
> 0xfffffe00011812a0
> mps0: EventReply        :
>         EventDataLength: 2
>         AckRequired: 0
>         Event: SasDiscovery (0x16)
>         EventContext: 0x0
>         Flags: 1<InProgress>
>         ReasonCode: Discovery Started
>         PhysicalPort: 0
>         DiscoveryStatus: 0
> mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
> mps0: (1)->(mpssas_fw_work) Event Free: [16]
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
> length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb
> 0xfffff8019458e000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
> 131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 completed timedout cm
> 0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
> 01 00 00
> 020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state
> c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
> length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb
> 0xfffff80054e86000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length
> 131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb
> 0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length
> 131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0
> count 4
>         (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting
> TaskMID 634
> mps0: Unfreezing devq for target ID 37
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: Command timeout
> (da11:mps0:0:37:0): Retrying command
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
> mps0: EventReply        :
>         EventDataLength: 4
>         AckRequired: 0
>         Event: SasTopologyChangeList (0x1c)
>         EventContext: 0x0
>         EnclosureHandle: 0x2
>         ExpanderDevHandle: 0x9
>         NumPhys: 25
>         NumEntries: 1
>         StartPhyNum: 12
>         ExpStatus: Responding (0x3)
>         PhysicalPort: 0
>         PHY[12].AttachedDevHandle: 0x0013
>         PHY[12].LinkRate: 6.0Gbps (0xaa)
>         PHY[12].PhyStatus: PHYLinkStatusChange
> mps0: (1)->(mpssas_fw_work) Working on  Event: [1c]
> mps0: (2)->(mpssas_fw_work) Event Free: [1c]
> mps0: EventReply        :
>         EventDataLength: 2
>         AckRequired: 0
>         Event: SasDiscovery (0x16)
>         EventContext: 0x0
>         Flags: 0
>         ReasonCode: Discovery Complete
>         PhysicalPort: 0
>         DiscoveryStatus: 0
> mps0: (2)->(mpssas_fw_work) Working on  Event: [16]
> mps0: (3)->(mpssas_fw_work) Event Free: [16]
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: SCSI Status Error
> (da11:mps0:0:37:0): SCSI status: Check Condition
> (da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da11:mps0:0:37:0): Retrying command (per sense data)
>
> This is what's running right now that produced that -- note no errors
> reported on the two providers (the degraded state is intentional; the
> third unit is off-site in a bank vault):
>
> [karl@NewFS /var/log]$ zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device with
>         'zpool replace'.
>   scan: scrub in progress since Tue Feb  5 06:57:14 2019
>         840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total
>         0 repaired, 15.04% done, 0 days 06:44:17 to go
> config:
>
>         NAME                      STATE     READ WRITE CKSUM
>         backup                    DEGRADED     0     0     0
>           mirror-0                DEGRADED     0     0     0
>             gpt/backup61.eli      ONLINE       0     0     0
>             gpt/backup62-1.eli    ONLINE       0     0     0
>             13282812295755460479  OFFLINE      0     0     0  was
> /dev/gpt/backup62-2.eli
>
> errors: No known data errors
>
> But, here's the SMART data on the disk -- it DOES show the reset:
>
> Before....
>
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 2) ==
> 0x01  0x008  4              19  ---  Lifetime Power-On Resets
> 0x01  0x018  6     32002138534  ---  Logical Sectors Written
> 0x01  0x020  6       155895325  ---  Number of Write Commands
> 0x01  0x028  6    265457016500  ---  Logical Sectors Read
> 0x01  0x030  6      1132601490  ---  Number of Read Commands
> 0x01  0x038  6     61537515300  ---  Date and Time TimeStamp
> 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> 0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
> 0x03  0x010  4            6114  ---  Head Flying Hours
> 0x03  0x018  4           30602  ---  Head Load Events
> 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> 0x03  0x028  4          283735  ---  Read Recovery Attempts
> 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> 0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
> Completio
> n
> 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> 0x05  0x008  1              38  ---  Current Temperature
> 0x05  0x010  1              40  N--  Average Short Term Temperature
> 0x05  0x018  1              32  N--  Average Long Term Temperature
> 0x05  0x020  1              52  ---  Highest Temperature
> 0x05  0x028  1              23  ---  Lowest Temperature
> 0x05  0x030  1              48  N--  Highest Average Short Term Temperature
> 0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
> 0x05  0x040  1              44  N--  Highest Average Long Term Temperature
> 0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
> 0x05  0x050  4               0  ---  Time in Over-Temperature
> 0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
> 0x05  0x060  4               0  ---  Time in Under-Temperature
> 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> 0x06  0x008  4               6  ---  Number of Hardware Resets
> 0x06  0x010  4               0  ---  Number of ASR Events
> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>                                 |||_ C monitored condition met
>                                 ||__ D supports DSN
>                                 |___ N normalized value
>
> And after....
>
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 2) ==
> 0x01  0x008  4              19  ---  Lifetime Power-On Resets
> 0x01  0x018  6     32004195070  ---  Logical Sectors Written
> 0x01  0x020  6       155946732  ---  Number of Write Commands
> 0x01  0x028  6    266340052468  ---  Logical Sectors Read
> 0x01  0x030  6      1136347463  ---  Number of Read Commands
> 0x01  0x038  6     61540814700  ---  Date and Time TimeStamp
> 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> 0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
> 0x03  0x010  4            6114  ---  Head Flying Hours
> 0x03  0x018  4           30602  ---  Head Load Events
> 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> 0x03  0x028  4          283848  ---  Read Recovery Attempts
> 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> 0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
> Completion
> 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> 0x05  0x008  1              41  ---  Current Temperature
> 0x05  0x010  1              40  N--  Average Short Term Temperature
> 0x05  0x018  1              32  N--  Average Long Term Temperature
> 0x05  0x020  1              52  ---  Highest Temperature
> 0x05  0x028  1              23  ---  Lowest Temperature
> 0x05  0x030  1              48  N--  Highest Average Short Term Temperature
> 0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
> 0x05  0x040  1              44  N--  Highest Average Long Term Temperature
> 0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
> 0x05  0x050  4               0  ---  Time in Over-Temperature
> 0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
> 0x05  0x060  4               0  ---  Time in Under-Temperature
> 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> 0x06  0x008  4               7  ---  Number of Hardware Resets
> 0x06  0x010  4               0  ---  Number of ASR Events
> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>                                 |||_ C monitored condition met
>                                 ||__ D supports DSN
>                                 |___ N normalized value
>
> Number of Hardware Resets has incremented.  There are no other errors shown:
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
>   2 Throughput_Performance  P-S---   137   137   054    -    104
>   3 Spin_Up_Time            POS---   132   132   024    -    497
> (Average 497)
>   4 Start_Stop_Count        -O--C-   099   099   000    -    7074
>   5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
>   7 Seek_Error_Rate         PO-R--   100   100   067    -    0
>   8 Seek_Time_Performance   P-S---   128   128   020    -    18
>   9 Power_On_Hours          -O--C-   098   098   000    -    17094
>  10 Spin_Retry_Count        PO--C-   100   100   060    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   000    -    19
> 192 Power-Off_Retract_Count -O--CK   075   075   000    -    30602
> 193 Load_Cycle_Count        -O--C-   075   075   000    -    30602
> 194 Temperature_Celsius     -O----   146   146   000    -    41 (Min/Max
> 23/52)
> 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
> 197 Current_Pending_Sector  -O---K   100   100   000    -    0
> 198 Offline_Uncorrectable   ---R--   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
>
> I'd throw possible shade at the backplane or cable /but I have already
> swapped both out for spares without any change in behavior./
>
--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Borja Marcos-2


> On 5 Feb 2019, at 23:49, Karl Denninger <[hidden email]> wrote:
>
> BTW under 12.0-STABLE (built this afternoon after the advisories came
> out, with the patches) it's MUCH worse.  I get the same device resets
> BUT it's followed by an immediate panic which I cannot dump as it
> generates a page-fault (supervisor read data, page not present) in the
> mps *driver* at mpssas_send_abort+0x21.

> This precludes a dump of course since attempting to do so gives you a
> double-panic (I was wondering why I didn't get a crash dump!); I'll
> re-jigger the box to stick a dump device on an internal SATA device so I
> can successfully get the dump when it happens and see if I can obtain a
> proper crash dump on this.
>
> I think it's fair to assume that 12.0-STABLE should not panic on a disk
> problem (unless of course the problem is trying to page something back
> in -- it's not, the drive that aborts and resets is on a data pack doing
> a scrub)

It shouldn’t panic I imagine.

>>>> mps0: Sending reset from mpssas_send_abort for target ID 37


>> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
>> 0x06  0x008  4               6  ---  Number of Hardware Resets
>> 0x06  0x010  4               0  ---  Number of ASR Events
>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>                                 |||_ C monitored condition met
>>                                 ||__ D supports DSN
>>                                 |___ N normalized value
>>
>> 0x06  0x008  4               7  ---  Number of Hardware Resets
>> 0x06  0x010  4               0  ---  Number of ASR Events
>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>                                 |||_ C monitored condition met
>>                                 ||__ D supports DSN
>>                                 |___ N normalized value
>>
>> Number of Hardware Resets has incremented.  There are no other errors shown:

What is _exactly_ that value? Is it related to the number of resets sent from the HBA
_or_ the device resetting by itself?

>> I'd throw possible shade at the backplane or cable /but I have already
>> swapped both out for spares without any change in behavior./

What about the power supply?





Borja.


_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Karl Denninger
On 2/6/2019 09:18, Borja Marcos wrote:

>> On 5 Feb 2019, at 23:49, Karl Denninger <[hidden email]> wrote:
>>
>> BTW under 12.0-STABLE (built this afternoon after the advisories came
>> out, with the patches) it's MUCH worse.  I get the same device resets
>> BUT it's followed by an immediate panic which I cannot dump as it
>> generates a page-fault (supervisor read data, page not present) in the
>> mps *driver* at mpssas_send_abort+0x21.
>> This precludes a dump of course since attempting to do so gives you a
>> double-panic (I was wondering why I didn't get a crash dump!); I'll
>> re-jigger the box to stick a dump device on an internal SATA device so I
>> can successfully get the dump when it happens and see if I can obtain a
>> proper crash dump on this.
>>
>> I think it's fair to assume that 12.0-STABLE should not panic on a disk
>> problem (unless of course the problem is trying to page something back
>> in -- it's not, the drive that aborts and resets is on a data pack doing
>> a scrub)
> It shouldn’t panic I imagine.
>
>>>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>
>>> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
>>> 0x06  0x008  4               6  ---  Number of Hardware Resets
>>> 0x06  0x010  4               0  ---  Number of ASR Events
>>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>>                                 |||_ C monitored condition met
>>>                                 ||__ D supports DSN
>>>                                 |___ N normalized value
>>>
>>> 0x06  0x008  4               7  ---  Number of Hardware Resets
>>> 0x06  0x010  4               0  ---  Number of ASR Events
>>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>>                                 |||_ C monitored condition met
>>>                                 ||__ D supports DSN
>>>                                 |___ N normalized value
>>>
>>> Number of Hardware Resets has incremented.  There are no other errors shown:
> What is _exactly_ that value? Is it related to the number of resets sent from the HBA
> _or_ the device resetting by itself?
Good question.  What counts as a "reset"; UNIT ATTENTION is what the
controller receives but whether that's a power reset, a reset *command*
from the HBA or a firmware crash (yikes!) in the disk I'm not certain.
>>> I'd throw possible shade at the backplane or cable /but I have already
>>> swapped both out for spares without any change in behavior./
> What about the power supply?
>
There are multiple other devices and the system board on that supply
(and thus voltage rails) but it too has been swapped out without
change.  In fact at this point other than the system board and RAM
(which is ECC, and is showing no errors in the system's BMC log)
/everything /in the server case (HBA, SATA expander, backplane, power
supply and cables) has been swapped for spares.  No change in behavior.

Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
I get a *controller* reset (!!!) which detaches some random number of
devices from ZFS's point of view before it comes back up (depending on
what's active at the time) which is potentially catastrophic if it hits
the system pool.  I immediately went back to 19.0.0.0 firmware on the
HBA; I had upgraded to 20.0.7.0 since there had been good reports of
stability with it when I first saw this, thinking there was a drive
change that might have resulted in issues with it when running 19.0
firmware on the card.

This system was completely stable for over a year on 11.1-STABLE and in
fact hadn't been rebooted or logged a single "event" in over six months;
the problems started immediately upon upgrade to 11.2-STABLE and
persists on 12.0-STABLE.  The disks in question haven't changed either
(so it can't be a difference in firmware that is in a newer purchased
disk, for example.)

I'm thinking perhaps *something* in the codebase change made the HBA ->
SAS Expander combination trouble where it wasn't before; I've got a
couple of 16i HBAs on the way which will allow me to remove the SAS
expander to see if that causes the problem to disappear.  I've got a
bunch of these Lenovo expanders and have been using them without any
sort of trouble in multiple machines; it's only when I went beyond 11.1
that I started having trouble with them.

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

Borja Marcos-2


> On 6 Feb 2019, at 16:34, Karl Denninger <[hidden email]> wrote:
>
> On 2/6/2019 09:18, Borja Marcos wrote:
>>>> Number of Hardware Resets has incremented.  There are no other errors shown:
>> What is _exactly_ that value? Is it related to the number of resets sent from the HBA
>> _or_ the device resetting by itself?
> Good question.  What counts as a "reset"; UNIT ATTENTION is what the
> controller receives but whether that's a power reset, a reset *command*
> from the HBA or a firmware crash (yikes!) in the disk I'm not certain.

In my youth I wrote software for tape drives. After a reset, no matter how it was
initiated (the device itself or the HBA) the device will give you a UNIT ATTENTION
if I remember well (25 years ago).


>>>> I'd throw possible shade at the backplane or cable /but I have already
>>>> swapped both out for spares without any change in behavior./
>> What about the power supply?
>>
> There are multiple other devices and the system board on that supply
> (and thus voltage rails) but it too has been swapped out without
> change.  In fact at this point other than the system board and RAM
> (which is ECC, and is showing no errors in the system's BMC log)
> /everything /in the server case (HBA, SATA expander, backplane, power
> supply and cables) has been swapped for spares.  No change in behavior.
>
> Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
> I get a *controller* reset (!!!) which detaches some random number of
> devices from ZFS's point of view before it comes back up (depending on
> what's active at the time) which is potentially catastrophic if it hits
> the system pool.  I immediately went back to 19.0.0.0 firmware on the
> HBA; I had upgraded to 20.0.7.0 since there had been good reports of
> stability with it when I first saw this, thinking there was a drive
> change that might have resulted in issues with it when running 19.0
> firmware on the card.

I have a system running 12.0-RELEASE-p1 with a LSI2008, 15 SAS disks and a SATA SSD
and I haven’t seen any problems. This is heavily loaded with just 8 GB of memory and a lot
of stuff running.

mps0: <Avago Technologies (LSI) SAS2008> port 0x9000-0x90ff mem 0xdfff0000-0xdff
fffff,0xdff80000-0xdffbffff irq 17 at device 0.0 numa-domain 0 on pci4
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

ses0:
        Enclosure Name: LSILOGIC SASX28 A.0 5021

> This system was completely stable for over a year on 11.1-STABLE and in
> fact hadn't been rebooted or logged a single "event" in over six months;
> the problems started immediately upon upgrade to 11.2-STABLE and
> persists on 12.0-STABLE.  The disks in question haven't changed either
> (so it can't be a difference in firmware that is in a newer purchased
> disk, for example.)

But you are right, a panic because of a disk problem points to a bug. As long as the ZFS
pool is usable, trouble with one of its disks should just be logged. Unless of course
the disk is used for swap or the disk failure leads to the system being unable to
complete a page in. Again, it shouldn’t happen.

> I'm thinking perhaps *something* in the codebase change made the HBA ->
> SAS Expander combination trouble where it wasn't before; I've got a
> couple of 16i HBAs on the way which will allow me to remove the SAS
> expander to see if that causes the problem to disappear.  I've got a
> bunch of these Lenovo expanders and have been using them without any
> sort of trouble in multiple machines; it's only when I went beyond 11.1
> that I started having trouble with them.

It might be some backplane misbehavior triggering a bug, complicated.





Borja.



_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"