AOC-USAS2-L8i zfs panics and SCSI errors in messages

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

AOC-USAS2-L8i zfs panics and SCSI errors in messages

Karli Sjöberg-2
Hi,

I´m in the process of vacating a Sun/Oracle system to a another Supermicro/FreeBSD system, doing zfs send/recv between. Two times now, the system has panicked while not doing anything at all, and it´s throwing alot of SCSI/CAM-related errors while doing IO-intensive operations, like send/recv, resilver, and zpool has sometimes reported read/write errors on the hard drives. Best part is that the errors in messages are about all hard drives at one time or another, and they are connected with separate cables, controllers and caddies. Specs:

HW:
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB (L2ARC)

SW:
# uname -a
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
# zpool get version pool1
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default[/CODE]

I got the panic from the IPMI KVM:
http://i55.tinypic.com/synpzk.png


And an extract from /var/log/messages:
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 66 f 0 0 f 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 859
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 495
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 725
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 722
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:40:38 fs2-7 last message repeated 3 times
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 859 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 495
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 495 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 725
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 725 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 722
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 722 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 438 complete
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4f 75 0 0 b 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 10 ca 0 0 80 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 636
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 888
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:45:41 fs2-7 last message repeated 2 times
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 976 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 636
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 636 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 888
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 888 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 983 complete
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a7 2 0 0 3 0
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b0 9 0 0 9 0
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

What´s going on?

Regards
Karli Sjöberg_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Martin Nilsson-3
On 2011-10-20 13:28, Karli Sjöberg wrote:
> 2x  Supermicro AOC-USAS2-L8i
>
How old firmware do you have in these LSI2008 cards? Latest at LSI:s web
is Phase11 and there are phase 9 & 10 on the Supermicro ftp site.
These boards should be the same as a LSI 9211-8i except that they have
components and the brackets on the wrong side.

/Martin

--
Martin Nilsson, CEO, Mullet Scandinavia AB, Malmö, SWEDEN
E-mail: [hidden email], Phone: +46-(0)708-59 99 91, Web: www.mullet.se

Our business is well engineered servers optimised for FreeBSD&  Linux

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Karli Sjöberg-2
I am using version 10.00.02.00 IT Firmware.

Update:

I have now experienced the exact same panic four times. It is the exact same panic every time, except that it happens on different cpu´s. It is always at exactly 03:01, when daily periodics is run. I have tried shuffling over the same filesystem from the Oracle machine every time and it always have had time to finish properly. Last time it finished and was idle for about 6 hours and was working fine- I checked in at about 22:00 and looked at zpool status; it was clean. Restarting the machine and running periodics daily manually works. If I don´t send any filesystems over, the machine is stable over the nights, but once I´ve sent something over, at 03:01, it panics.

I am going to try shuffling over another filesystem to see if there´s anything in that specific filesystem that causes the crash, or if it happens regardless of which filesystem has been recieved.

/Karli Sjöberg

20 okt 2011 kl. 18.55 skrev Martin Nilsson:

On 2011-10-20 13:28, Karli Sjöberg wrote:
2x  Supermicro AOC-USAS2-L8i

How old firmware do you have in these LSI2008 cards? Latest at LSI:s web
is Phase11 and there are phase 9 & 10 on the Supermicro ftp site.
These boards should be the same as a LSI 9211-8i except that they have
components and the brackets on the wrong side.

/Martin

--
Martin Nilsson, CEO, Mullet Scandinavia AB, Malmö, SWEDEN
E-mail: [hidden email]<mailto:[hidden email]>, Phone: +46-(0)708-59 99 91, Web: www.mullet.se<http://www.mullet.se>

Our business is well engineered servers optimised for FreeBSD&  Linux




Med Vänliga Hälsningar
-------------------------------------------------------------------------------
Karli Sjöberg
Swedish University of Agricultural Sciences
Box 7079 (Visiting Address Kronåsvägen 8)
S-750 07 Uppsala, Sweden
Phone:  +46-(0)18-67 15 66
[hidden email]<mailto:[hidden email]>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Ken Merry
In reply to this post by Karli Sjöberg-2
On Thu, Oct 20, 2011 at 13:28:17 +0200, Karli Sj?berg wrote:

> Hi,
>
> I?m in the process of vacating a Sun/Oracle system to a another Supermicro/FreeBSD system, doing zfs send/recv between. Two times now, the system has panicked while not doing anything at all, and it?s throwing alot of SCSI/CAM-related errors while doing IO-intensive operations, like send/recv, resilver, and zpool has sometimes reported read/write errors on the hard drives. Best part is that the errors in messages are about all hard drives at one time or another, and they are connected with separate cables, controllers and caddies. Specs:
>
> HW:
> 1x  Supermicro X8SIL-F
> 2x  Supermicro AOC-USAS2-L8i
> 2x  Supermicro CSE-M35T-1B
> 1x  Intel Core i5 650 3,2GHz
> 4x  2GB 1333MHZ DDR3 ECC UDIMM
> 10x SAMSUNG HD204UI (in a raidz2 zpool)
> 1x  OCZ Vertex 3 240GB (L2ARC)
>
> SW:
> # uname -a
> FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
> # zpool get version pool1
> NAME   PROPERTY  VALUE    SOURCE
> pool1  version   28       default[/CODE]
>
> I got the panic from the IPMI KVM:
> http://i55.tinypic.com/synpzk.png

In looking at the panic, this is a ZFS panic.  Nothing the disks do should
be able to cause ZFS to panic.  ZFS is panicing in avl_add():

        /*
         * This is unfortunate.  We want to call panic() here, even for
         * non-DEBUG kernels.  In userland, however, we can't depend on anything
         * in libc or else the rtld build process gets confused.  So, all we can
         * do in userland is resort to a normal ASSERT().
         */
        if (avl_find(tree, new_node, &where) != NULL)
#ifdef _KERNEL
                panic("avl_find() succeeded inside avl_add()");
#else
                ASSERT(0);
#endif

There are certainly timeouts and two terminated IOCs in the log below.  That
does suggest a hardware or driver problem, but it isn't very obvious what
it might be.

I have seen bad behavior with SATA drives behind 3Gb Maxim expanders
talking to 6GB LSI controllers, but your particular configuration does not
involve any expanders, and therefore is not that particular STP issue.

My best guess, and it is a guess, is that either the drives are misbehaving
(i.e. firmware type problem) or you've got a cabling issue.

If you have more hardware available, you might try swapping out the cables
and/or drives to see if you can reproduce the drive errors with a
different setup.  If you swap the drives, I would use a different brand if
you've got them available.

I'm CCing the fs list, perhaps someone there can look at the stack trace
above and figure out what ZFS might be doing.

Again, ZFS should survive any errors from the drives, and the panic above
looks like ZFS is flagging a logic bug somewhere.

>
> And an extract from /var/log/messages:
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 66 f 0 0 f 0
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
> Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 859
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 495
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 725
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 722
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 438
> Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 state c xfer 0
> Oct 19 17:40:38 fs2-7 last message repeated 3 times
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 859 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 495
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 495 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 725
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 725 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 722
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 722 complete
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 438
> Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 438 complete
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4f 75 0 0 b 0
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 10 ca 0 0 80 0
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
> Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 636
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 888
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 983
> Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 state c xfer 0
> Oct 19 17:45:41 fs2-7 last message repeated 2 times
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 976 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 636
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 636 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 888
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 888 complete
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 983
> Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 983 complete
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a7 2 0 0 3 0
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
> Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b0 9 0 0 9 0
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
> Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>
> What?s going on?
>
> Regards
> Karli Sj?berg_______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"

Ken
--
Kenneth Merry
[hidden email]
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Karli Sjöberg-2
Hi all,

I tracked down what causes the panics!

I got a tip from aragon and phoenix at the forum about
/etc/periodic/security/100.chksetuid

And to put:
daily_status_security_chksetuid_enable="NO"
into /etc/periodic.conf

I can now run periodic daily without any panics. I´m still wondering about the cause of this, the explanation from the forum was that that phase is too demanding for multi TB systems. But I have several multi TB servers with FreeBSD and ZFS, and none of them has ever behaved this way. Besides, the panic is instantaneous, not degenerative. I imagine that a run like that would start out OK and then just get worse and worse, getting gradually slower and slower until it just wouldn´t cope any more and hang. This feels more like hitting a wall. As if it found something that is couldn´t deal with and has no choice but to panic immediately.

I´m hoping this can be resolved without having to know beforehand about putting stuff into periodic.conf that you couldn´t have anticipated?

@Ken
The hard drives are connected with two breakout cables from each controller to the caddies with CABMS2FN05 from:
http://www.promise.com/single_page_session/page.aspx?region=en-US&m=575&rsn=114

>From controller 1 -> channel 1 -> ports 1,2,3 -> ports 1,2,3 in caddie 1
>From controller 1 -> channel 2 -> ports 1,2 -> ports 4,5 in caddie 1

>From controller 2 -> channel 1 -> ports 1,2,3 -> ports 1,2,3 in caddie 2
>From controller 2 -> channel 2 -> ports 1,2 -> ports 4,5 in caddie 2

Is there any problem with that type of cabling?

These timeouts happens with all harddrives at one time or another, would that mean that all cables are bad? Or of a worse quality perhaps? Regarding the firmware, they are all running version 1AQ10001. I´m going to search for known problems with that, and if you know something, your are welcome to share;)

Best Regards
Karli Sjöberg

25 okt 2011 kl. 21.33 skrev Kenneth D. Merry:

On Thu, Oct 20, 2011 at 13:28:17 +0200, Karli Sj?berg wrote:
Hi,

I?m in the process of vacating a Sun/Oracle system to a another Supermicro/FreeBSD system, doing zfs send/recv between. Two times now, the system has panicked while not doing anything at all, and it?s throwing alot of SCSI/CAM-related errors while doing IO-intensive operations, like send/recv, resilver, and zpool has sometimes reported read/write errors on the hard drives. Best part is that the errors in messages are about all hard drives at one time or another, and they are connected with separate cables, controllers and caddies. Specs:

HW:
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB (L2ARC)

SW:
# uname -a
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
# zpool get version pool1
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default[/CODE]

I got the panic from the IPMI KVM:
http://i55.tinypic.com/synpzk.png

In looking at the panic, this is a ZFS panic.  Nothing the disks do should
be able to cause ZFS to panic.  ZFS is panicing in avl_add():

/*
* This is unfortunate.  We want to call panic() here, even for
* non-DEBUG kernels.  In userland, however, we can't depend on anything
* in libc or else the rtld build process gets confused.  So, all we can
* do in userland is resort to a normal ASSERT().
*/
if (avl_find(tree, new_node, &where) != NULL)
#ifdef _KERNEL
panic("avl_find() succeeded inside avl_add()");
#else
ASSERT(0);
#endif

There are certainly timeouts and two terminated IOCs in the log below.  That
does suggest a hardware or driver problem, but it isn't very obvious what
it might be.

I have seen bad behavior with SATA drives behind 3Gb Maxim expanders
talking to 6GB LSI controllers, but your particular configuration does not
involve any expanders, and therefore is not that particular STP issue.

My best guess, and it is a guess, is that either the drives are misbehaving
(i.e. firmware type problem) or you've got a cabling issue.

If you have more hardware available, you might try swapping out the cables
and/or drives to see if you can reproduce the drive errors with a
different setup.  If you swap the drives, I would use a different brand if
you've got them available.

I'm CCing the fs list, perhaps someone there can look at the stack trace
above and figure out what ZFS might be doing.

Again, ZFS should survive any errors from the drives, and the panic above
looks like ZFS is flagging a logic bug somewhere.


And an extract from /var/log/messages:
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 66 f 0 0 f 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 859
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 495
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 725
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 722
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:40:38 fs2-7 last message repeated 3 times
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 859 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 495
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 495 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 725
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 725 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 722
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 722 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 438 complete
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4f 75 0 0 b 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 10 ca 0 0 80 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 636
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 888
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:45:41 fs2-7 last message repeated 2 times
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 976 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 636
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 636 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 888
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 888 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 983 complete
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a7 2 0 0 3 0
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b0 9 0 0 9 0
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

What?s going on?

Regards
Karli Sj?berg_______________________________________________
[hidden email]<mailto:[hidden email]> mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]<mailto:[hidden email]>"

Ken
--
Kenneth Merry
[hidden email]<mailto:[hidden email]>



Med Vänliga Hälsningar
-------------------------------------------------------------------------------
Karli Sjöberg
Swedish University of Agricultural Sciences
Box 7079 (Visiting Address Kronåsvägen 8)
S-750 07 Uppsala, Sweden
Phone:  +46-(0)18-67 15 66
[hidden email]<mailto:[hidden email]>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Jeremy Chadwick
On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:

> Hi all,
>
> I tracked down what causes the panics!
>
> I got a tip from aragon and phoenix at the forum about
> /etc/periodic/security/100.chksetuid
>
> And to put:
> daily_status_security_chksetuid_enable="NO"
> into /etc/periodic.conf

This is not truly the cause of the panic, it simply exacerbates it.

Many of the periodic scripts will do things like iterate over all files
on the filesystem looking for specific attributes, etc..  This tends to
stress filesystems heavily.  This isn't the only one.  :-)

> I can now run periodic daily without any panics. I?m still wondering
> about the cause of this, the explanation from the forum was that that
> phase is too demanding for multi TB systems. But I have several multi
> TB servers with FreeBSD and ZFS, and none of them has ever behaved
> this way. Besides, the panic is instantaneous, not degenerative. I
> imagine that a run like that would start out OK and then just get
> worse and worse, getting gradually slower and slower until it just
> wouldn?t cope any more and hang. This feels more like hitting a wall.
> As if it found something that is couldn?t deal with and has no choice
> but to panic immediately.

It may be possible that you have some underlying filesystem corruption
that triggers this situation.  Have you actually tried doing a "zpool
scrub" of your pools and seeing if any errors happen or if the panic
occurs there?

I'm inclined to think what you're experiencing is probably a bug or
"quirk" in the storage controller driver you're using.  There are other
drivers that have had fixes applied to them "to make them work decently
with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
the controller driver behaving oddly or freaking out, case in point.  It
could also be a controller firmware bug/quirk/design issue.  Seriously.

I believe the AOC-USAS2-L8i controller has been discussed on
freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
going to CC that list given that there would be 3 cross-posted lists
involved and that is liable to upset some folks.  You should search the
mailing lists for discussion of Supermicro controllers that work
reliably with FreeBSD.

It would be worthwhile to discuss this condition on -stable, mainly with
something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
You get the idea.

--
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Douglas Gilbert-2
On 11-10-26 06:16 AM, Jeremy Chadwick wrote:

> On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:
>> Hi all,
>>
>> I tracked down what causes the panics!
>>
>> I got a tip from aragon and phoenix at the forum about
>> /etc/periodic/security/100.chksetuid
>>
>> And to put:
>> daily_status_security_chksetuid_enable="NO"
>> into /etc/periodic.conf
>
> This is not truly the cause of the panic, it simply exacerbates it.
>
> Many of the periodic scripts will do things like iterate over all files
> on the filesystem looking for specific attributes, etc..  This tends to
> stress filesystems heavily.  This isn't the only one.  :-)
>
>> I can now run periodic daily without any panics. I?m still wondering
>> about the cause of this, the explanation from the forum was that that
>> phase is too demanding for multi TB systems. But I have several multi
>> TB servers with FreeBSD and ZFS, and none of them has ever behaved
>> this way. Besides, the panic is instantaneous, not degenerative. I
>> imagine that a run like that would start out OK and then just get
>> worse and worse, getting gradually slower and slower until it just
>> wouldn?t cope any more and hang. This feels more like hitting a wall.
>> As if it found something that is couldn?t deal with and has no choice
>> but to panic immediately.
>
> It may be possible that you have some underlying filesystem corruption
> that triggers this situation.  Have you actually tried doing a "zpool
> scrub" of your pools and seeing if any errors happen or if the panic
> occurs there?
>
> I'm inclined to think what you're experiencing is probably a bug or
> "quirk" in the storage controller driver you're using.  There are other
> drivers that have had fixes applied to them "to make them work decently
> with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
> the controller driver behaving oddly or freaking out, case in point.  It
> could also be a controller firmware bug/quirk/design issue.  Seriously.
>
> I believe the AOC-USAS2-L8i controller has been discussed on
> freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
> going to CC that list given that there would be 3 cross-posted lists
> involved and that is liable to upset some folks.  You should search the
> mailing lists for discussion of Supermicro controllers that work
> reliably with FreeBSD.
>
> It would be worthwhile to discuss this condition on -stable, mainly with
> something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
> You get the idea.

There is a steady stream of patches from LSI staff to
both the mptsas and mpt2sas drivers on the Linux SCSI
list (e.g. the most recent patch set to mpt2sas was on
20111019 and contained 7 separate "fixes").

I don't see these patches appearing on this list. Is there
a mechanism to get driver corrections incorporated into
the relevant FreeBSD drivers?

LSI do keep FreeBSD drivers on their site. For example for
the 9212-4i4e HBA, see this page:
   http://www.lsi.com/products/storagecomponents/Pages/LSISAS9212-4i4e.aspx
That FreeBSD zip is dated 20110808 and has mps drivers for
FreeBSD 7.2.0, 7.4.0, 8.2.0

Doug Gilbert


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Ken Merry
On Wed, Oct 26, 2011 at 10:05:33 -0400, Douglas Gilbert wrote:

> On 11-10-26 06:16 AM, Jeremy Chadwick wrote:
> >On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:
> >>Hi all,
> >>
> >>I tracked down what causes the panics!
> >>
> >>I got a tip from aragon and phoenix at the forum about
> >>/etc/periodic/security/100.chksetuid
> >>
> >>And to put:
> >>daily_status_security_chksetuid_enable="NO"
> >>into /etc/periodic.conf
> >
> >This is not truly the cause of the panic, it simply exacerbates it.
> >
> >Many of the periodic scripts will do things like iterate over all files
> >on the filesystem looking for specific attributes, etc..  This tends to
> >stress filesystems heavily.  This isn't the only one.  :-)
> >
> >>I can now run periodic daily without any panics. I?m still wondering
> >>about the cause of this, the explanation from the forum was that that
> >>phase is too demanding for multi TB systems. But I have several multi
> >>TB servers with FreeBSD and ZFS, and none of them has ever behaved
> >>this way. Besides, the panic is instantaneous, not degenerative. I
> >>imagine that a run like that would start out OK and then just get
> >>worse and worse, getting gradually slower and slower until it just
> >>wouldn?t cope any more and hang. This feels more like hitting a wall.
> >>As if it found something that is couldn?t deal with and has no choice
> >>but to panic immediately.
> >
> >It may be possible that you have some underlying filesystem corruption
> >that triggers this situation.  Have you actually tried doing a "zpool
> >scrub" of your pools and seeing if any errors happen or if the panic
> >occurs there?
> >
> >I'm inclined to think what you're experiencing is probably a bug or
> >"quirk" in the storage controller driver you're using.  There are other
> >drivers that have had fixes applied to them "to make them work decently
> >with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
> >the controller driver behaving oddly or freaking out, case in point.  It
> >could also be a controller firmware bug/quirk/design issue.  Seriously.
> >
> >I believe the AOC-USAS2-L8i controller has been discussed on
> >freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
> >going to CC that list given that there would be 3 cross-posted lists
> >involved and that is liable to upset some folks.  You should search the
> >mailing lists for discussion of Supermicro controllers that work
> >reliably with FreeBSD.
> >
> >It would be worthwhile to discuss this condition on -stable, mainly with
> >something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
> >You get the idea.
>
> There is a steady stream of patches from LSI staff to
> both the mptsas and mpt2sas drivers on the Linux SCSI
> list (e.g. the most recent patch set to mpt2sas was on
> 20111019 and contained 7 separate "fixes").
>
> I don't see these patches appearing on this list. Is there
> a mechanism to get driver corrections incorporated into
> the relevant FreeBSD drivers?
>
> LSI do keep FreeBSD drivers on their site. For example for
> the 9212-4i4e HBA, see this page:
>   http://www.lsi.com/products/storagecomponents/Pages/LSISAS9212-4i4e.aspx
> That FreeBSD zip is dated 20110808 and has mps drivers for
> FreeBSD 7.2.0, 7.4.0, 8.2.0

They do have a developer working on their version of the mps driver.
Release of the driver has been hung up by LSI's legal department since
February, unfortunately.  I'm not sure what the issue is, but that is why
it isn't in FreeBSD.

The plan, once LSI's legal department approves it, is to hopefully give
their developer commit access so he can just check fixes in to the driver.

For now, though, their binary-only drivers may fix things for some folks.
e.g., those drivers should support their Integrated RAID features.  (The
driver in the tree doesn't support them.)

The error recovery code in their driver is a bit better (the error recovery
part was written by Isilon), but I'm not sure whether it would fix this
particular problem.  This really looks like a ZFS issue.

Ken
--
Kenneth Merry
[hidden email]
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Karli Sjöberg-2
In reply to this post by Ken Merry
Hi,

I´m not alone!

By complete chance I was reading another thread on the forum and it turns out that peetaur also has the exact same problem as me with timeouts and sometimes losing disks. His hardware is very different from mine, except that we both have LSI controllers and are running 8.2-STABLE. He has tried both the mps-driver in FreeBSD and the mps-driver that LSI provides (phase 11), and still gets these timeouts.

peetaur´s system:
4HE Chassis from Supermicro 847E16-R1400LPB
with 2 1400 Watt red. power and 36x HotSwap for SAS or SATA

Motherboard from Supermicro
- Intel® 5520 (Tylersburg) Chipset
- 12 DIMM memory slots (max. 192GB DDR3)
- 2x 100/1000Base TX Gigabit Ethernet Port (Dual Intel® 82576 Gigabit Ethernet)
- 6x SATA (3 Gbps) Ports via ICH10R Controller
- PCI Slots: 7x (x8) PCI-E 2.0 (in x16 slots)
- Integrated IPMI 2.0 with Dedicated LAN
- Integrated Matrox G200eW Graphics
CPU
- 2x E5620 Intel Xeon (Westmere) Quad Core CPU, (80W) 2,40 GHz, 12 MB L3 Cache
RAM
- 48 GB (6x 8GB) DDR3 1333 DIMM, REG, ECC
SAS HBA
- 9211-8i
Network
- 10G Card with Dual-port Intel® 82598EB (CX4)
Disks
- 9x HDD 3TB SATA from Hitachi, 7.2k UPM, 64 MB Cache
- 9x HDD 3TB SATA from Seagate, 7.2k UPM, 64 MB Cache
- 2x consumer SSDs (boot, root, zil, cache)

#uname -a
FreeBSD bcnas1.bc.local 8.2-STABLE FreeBSD 8.2-STABLE #0: Thu Sep 29 15:06:03 CEST 2011     [hidden email]:/usr/obj/usr/src/sys/GENERIC  amd64

and a extract from /var/log/messages when using FreeBSD´s mps:
Oct  4 08:57:05 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 568
Oct  4 08:57:05 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 998
Oct  4 08:57:13 bcnas1 kernel: mps0: (0:0:0) terminated ioc 804b scsi 0 state c xfer 0
Oct  4 08:57:13 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 568 complete
Oct  4 08:57:13 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 998
Oct  4 08:57:13 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 998 complete
Oct  4 08:58:13 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 973
Oct  4 08:58:13 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 981
Oct  4 08:58:21 bcnas1 kernel: mps0: (0:0:0) terminated ioc 804b scsi 0 state c xfer 0
Oct  4 08:58:21 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 973 complete
Oct  4 08:58:21 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 981
Oct  4 08:58:21 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 981 complete
Oct  4 08:58:24 bcnas1 kernel: (da3:mps0:0:0:0): READ(6). CDB: 8 0 0 0 80 0
Oct  4 08:58:24 bcnas1 kernel: (da3:mps0:0:0:0): CAM status: SCSI Status Error
Oct  4 08:58:24 bcnas1 kernel: (da3:mps0:0:0:0): SCSI status: Check Condition
Oct  4 08:58:24 bcnas1 kernel: (da3:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct  4 09:00:14 bcnas1 kernel: mps0: mpssas_remove_complete on target 0x0000, IOCStatus= 0x0
Oct  4 09:00:14 bcnas1 kernel: (da3:mps0:0:0:0): lost device

and a extract from /var/log/messages when using LSI´s mps:
Nov  3 09:17:10 bcnas1bak kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff800f629000 cm 0xffffff800f65f698
Nov  3 09:17:10 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 6 2c 0 da 0 0 0 0 0 4f 0 c2 0 b0 0 length 0 SMID 717 command timeout cm 0xffffff800f65f698 ccb 0xffffff0026bbb800
Nov  3 09:17:10 bcnas1bak kernel: mpslsi0: mpssas_alloc_tm freezing simq
Nov  3 09:17:10 bcnas1bak kernel: mpslsi0: timedout cm 0xffffff800f65f698 allocated tm 0xffffff800f6340f8
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 2c f3 be e2 0 0 2a 0 length 21504 SMID 261 completed cm 0xffffff800f643cd8 ccb 0xffffff0026bd1000 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 2c f3 be e2 0 0 2a 0 length 21504 SMID 261 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 52 1e 2 e3 0 0 2b 0 length 22016 SMID 534 completed cm 0xffffff800f654550 ccb 0xffffff0026b96000 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 52 1e 2 e3 0 0 2b 0 length 22016 SMID 534 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 3a 5 14 a3 0 0 2b 0 length 22016 SMID 798 completed cm 0xffffff800f664510 ccb 0xffffff003d438000 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 3a 5 14 a3 0 0 2b 0 length 22016 SMID 798 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 39 81 86 6f 0 0 2b 0 length 22016 SMID 590 completed cm 0xffffff800f657b90 ccb 0xffffff00314ce800 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 39 81 86 6f 0 0 2b 0 length 22016 SMID 590 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 39 47 e8 2c 0 0 2a 0 length 21504 SMID 634 completed cm 0xffffff800f65a630 ccb 0xffffff0026ba1800 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 39 47 e8 2c 0 0 2a 0 length 21504 SMID 634 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 2d 8b 96 af 0 0 2b 0 length 22016 SMID 707 completed cm 0xffffff800f65ece8 ccb 0xffffff0026bb1800 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 2d 8b 96 af 0 0 2b 0 length 22016 SMID 707 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 6 2c 0 da 0 0 0 0 0 4f 0 c2 0 b0 0 length 0 SMID 717 completed timedout cm 0xffffff800f65f698 ccb 0xffffff0026bbb800 during recov(da0:mpslsi0:0:10:0): R$
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 1c dc 68 73 0 0 2b 0 length 22016 SMID 690 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 58 d da 33 0 0 2b 0 length 22016 SMID 947 completed cm 0xffffff800f66d568 ccb 0xffffff0026bf9000 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 58 d da 33 0 0 2b 0 length 22016 SMID 947 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 4b 30 d1 80 0 0 2a 0 length 21504 SMID 683 completed cm 0xffffff800f65d5a8 ccb 0xffffff003d47f800 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 4b 30 d1 80 0 0 2a 0 length 21504 SMID 683 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 4a d 10 d0 0 0 2b 0 length 22016 SMID 219 completed cm 0xffffff800f641428 ccb 0xffffff0031536000 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 4a d 10 d0 0 0 2b 0 length 22016 SMID 219 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 41 1e 9a 58 0 0 2a 0 length 21504 SMID 169 completed cm 0xffffff800f63e3b8 ccb 0xffffff00314ec800 during recovery ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 41 1e 9a 58 0 0 2a 0 length 21504 SMID 169 terminated ioc 804b scsi 0 state c xfer 0
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d0 0 1 0 0 0 4f 0 c2 0 b0 0 length 512 SMID 139 completed cm 0xffffff800f63c6a8 ccb 0xffffff0026a89000 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 6 2c 0 da 0 0 0 0 0 4f 0 c2 0 b0 0 length 0 SMID 876 completed cm 0xffffff800f6690a0 ccb 0xffffff00314c8800 during recovery ioc 8(pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d5 0 1 0 6 0 4f 0 c2 0 b0 0 length 512 SMID 661 completed cm 0xffffff800f65c058 ccb 0xffffff0026b7d000 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d5 0 1 0 6 0 4f 0 c2 0 b0 0 length 512 SMID 471 completed cm 0xffffff800f650848 ccb 0xffffff0026be7800 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d0 0 1 0 0 0 4f 0 c2 0 b0 0 length 512 SMID 215 completed cm 0xffffff800f641048 ccb 0xffffff0026bef800 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d5 0 1 0 6 0 4f 0 c2 0 b0 0 length 512 SMID 203 completed cm 0xffffff800f6404a8 ccb 0xffffff0026bb6000 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d0 0 1 0 0 0 4f 0 c2 0 b0 0 length 512 SMID 546 completed cm 0xffffff800f6550f0 ccb 0xffffff003d447800 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (pass0:mpslsi0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 8 e 0 d5 0 1 0 6 0 4f 0 c2 0 b0 0 length 512 SMID 513 completed cm 0xffffff800f6530f8 ccb 0xffffff0026bcb800 during recovery ioc (pass0:mpslsi0:0:10:0):$
Nov  3 09:17:11 bcnas1bak kernel: (noperiph:mpslsi0:0:10:0): SMID 1 abort TaskMID 717 status 0x0 code 0x0 count 20
Nov  3 09:17:11 bcnas1bak kernel: (noperiph:mpslsi0:0:10:0): SMID 1 finished recovery after aborting TaskMID 717
Nov  3 09:17:11 bcnas1bak kernel: mpslsi0: mpssas_free_tm releasing simq
Nov  3 09:17:17 bcnas1bak kernel: (da0:mpslsi0:0:10:0): READ(10). CDB: 28 0 41 1e 9a 58 0 0 2a 0
Nov  3 09:17:17 bcnas1bak kernel: (da0:mpslsi0:0:10:0): CAM status: SCSI Status Error
Nov  3 09:17:17 bcnas1bak kernel: (da0:mpslsi0:0:10:0): SCSI status: Check Condition
Nov  3 09:17:17 bcnas1bak kernel: (da0:mpslsi0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)


/Karli

25 okt 2011 kl. 21.33 skrev Kenneth D. Merry:

On Thu, Oct 20, 2011 at 13:28:17 +0200, Karli Sj?berg wrote:
Hi,

I?m in the process of vacating a Sun/Oracle system to a another Supermicro/FreeBSD system, doing zfs send/recv between. Two times now, the system has panicked while not doing anything at all, and it?s throwing alot of SCSI/CAM-related errors while doing IO-intensive operations, like send/recv, resilver, and zpool has sometimes reported read/write errors on the hard drives. Best part is that the errors in messages are about all hard drives at one time or another, and they are connected with separate cables, controllers and caddies. Specs:

HW:
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB (L2ARC)

SW:
# uname -a
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
# zpool get version pool1
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default[/CODE]

I got the panic from the IPMI KVM:
http://i55.tinypic.com/synpzk.png

In looking at the panic, this is a ZFS panic.  Nothing the disks do should
be able to cause ZFS to panic.  ZFS is panicing in avl_add():

/*
* This is unfortunate.  We want to call panic() here, even for
* non-DEBUG kernels.  In userland, however, we can't depend on anything
* in libc or else the rtld build process gets confused.  So, all we can
* do in userland is resort to a normal ASSERT().
*/
if (avl_find(tree, new_node, &where) != NULL)
#ifdef _KERNEL
panic("avl_find() succeeded inside avl_add()");
#else
ASSERT(0);
#endif

There are certainly timeouts and two terminated IOCs in the log below.  That
does suggest a hardware or driver problem, but it isn't very obvious what
it might be.

I have seen bad behavior with SATA drives behind 3Gb Maxim expanders
talking to 6GB LSI controllers, but your particular configuration does not
involve any expanders, and therefore is not that particular STP issue.

My best guess, and it is a guess, is that either the drives are misbehaving
(i.e. firmware type problem) or you've got a cabling issue.

If you have more hardware available, you might try swapping out the cables
and/or drives to see if you can reproduce the drive errors with a
different setup.  If you swap the drives, I would use a different brand if
you've got them available.

I'm CCing the fs list, perhaps someone there can look at the stack trace
above and figure out what ZFS might be doing.

Again, ZFS should survive any errors from the drives, and the panic above
looks like ZFS is flagging a logic bug somewhere.


And an extract from /var/log/messages:
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 66 f 0 0 f 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Error
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Condition
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 859
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 495
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 725
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 722
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on device handle 0x000c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:40:38 fs2-7 last message repeated 3 times
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 859 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 495
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 495 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 725
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 725 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 722
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 722 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending deferred task management request for handle 0x0c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x0c SMID 438 complete
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4f 75 0 0 b 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 10 ca 0 0 80 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Error
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Condition
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 636
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 888
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 state c xfer 0
Oct 19 17:45:41 fs2-7 last message repeated 2 times
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 976 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 636
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 636 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 888
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 888 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 983 complete
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a7 2 0 0 3 0
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b0 9 0 0 9 0
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

What?s going on?

Regards
Karli Sj?berg_______________________________________________
[hidden email]<mailto:[hidden email]> mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]<mailto:[hidden email]>"

Ken
--
Kenneth Merry
[hidden email]<mailto:[hidden email]>



Med Vänliga Hälsningar
-------------------------------------------------------------------------------
Karli Sjöberg
Swedish University of Agricultural Sciences
Box 7079 (Visiting Address Kronåsvägen 8)
S-750 07 Uppsala, Sweden
Phone:  +46-(0)18-67 15 66
[hidden email]<mailto:[hidden email]>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Karli Sjöberg-2
In reply to this post by Jeremy Chadwick
As a test, I have copied in about 1.5TB and scrubbed several times without any panic. It stayed solid until periodic weekly:( Same panic as with daily.

/Karli Sjöberg

26 okt 2011 kl. 12.16 skrev Jeremy Chadwick:

On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:
Hi all,

I tracked down what causes the panics!

I got a tip from aragon and phoenix at the forum about
/etc/periodic/security/100.chksetuid

And to put:
daily_status_security_chksetuid_enable="NO"
into /etc/periodic.conf

This is not truly the cause of the panic, it simply exacerbates it.

Many of the periodic scripts will do things like iterate over all files
on the filesystem looking for specific attributes, etc..  This tends to
stress filesystems heavily.  This isn't the only one.  :-)

I can now run periodic daily without any panics. I?m still wondering
about the cause of this, the explanation from the forum was that that
phase is too demanding for multi TB systems. But I have several multi
TB servers with FreeBSD and ZFS, and none of them has ever behaved
this way. Besides, the panic is instantaneous, not degenerative. I
imagine that a run like that would start out OK and then just get
worse and worse, getting gradually slower and slower until it just
wouldn?t cope any more and hang. This feels more like hitting a wall.
As if it found something that is couldn?t deal with and has no choice
but to panic immediately.

It may be possible that you have some underlying filesystem corruption
that triggers this situation.  Have you actually tried doing a "zpool
scrub" of your pools and seeing if any errors happen or if the panic
occurs there?

I'm inclined to think what you're experiencing is probably a bug or
"quirk" in the storage controller driver you're using.  There are other
drivers that have had fixes applied to them "to make them work decently
with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
the controller driver behaving oddly or freaking out, case in point.  It
could also be a controller firmware bug/quirk/design issue.  Seriously.

I believe the AOC-USAS2-L8i controller has been discussed on
freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
going to CC that list given that there would be 3 cross-posted lists
involved and that is liable to upset some folks.  You should search the
mailing lists for discussion of Supermicro controllers that work
reliably with FreeBSD.

It would be worthwhile to discuss this condition on -stable, mainly with
something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
You get the idea.

--
| Jeremy Chadwick                                jdc at parodius.com<http://parodius.com> |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Med Vänliga Hälsningar
-------------------------------------------------------------------------------
Karli Sjöberg
Swedish University of Agricultural Sciences
Box 7079 (Visiting Address Kronåsvägen 8)
S-750 07 Uppsala, Sweden
Phone:  +46-(0)18-67 15 66
[hidden email]<mailto:[hidden email]>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Rich-3
Observation - the LSI SAS expanders, in my experience, sometimes
misbehave when there are drives which respond slower than some timeout
to commands (as far as I've seen it's only SATA drives it does this
for, but I don't have many SAS drives for comparison), leading to all
further commands to that drive for a bit not working, and then what
happens depending on the OS varies dramatically.

If you could try without an expander (e.g. with 1->4 SAS->SATA fanout
cables), you may be surprised (and/or annoyed) to find your life gets
better.

- Rich

On Mon, Nov 7, 2011 at 3:48 AM, Karli Sjöberg <[hidden email]> wrote:

> As a test, I have copied in about 1.5TB and scrubbed several times without any panic. It stayed solid until periodic weekly:( Same panic as with daily.
>
> /Karli Sjöberg
>
> 26 okt 2011 kl. 12.16 skrev Jeremy Chadwick:
>
> On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:
> Hi all,
>
> I tracked down what causes the panics!
>
> I got a tip from aragon and phoenix at the forum about
> /etc/periodic/security/100.chksetuid
>
> And to put:
> daily_status_security_chksetuid_enable="NO"
> into /etc/periodic.conf
>
> This is not truly the cause of the panic, it simply exacerbates it.
>
> Many of the periodic scripts will do things like iterate over all files
> on the filesystem looking for specific attributes, etc..  This tends to
> stress filesystems heavily.  This isn't the only one.  :-)
>
> I can now run periodic daily without any panics. I?m still wondering
> about the cause of this, the explanation from the forum was that that
> phase is too demanding for multi TB systems. But I have several multi
> TB servers with FreeBSD and ZFS, and none of them has ever behaved
> this way. Besides, the panic is instantaneous, not degenerative. I
> imagine that a run like that would start out OK and then just get
> worse and worse, getting gradually slower and slower until it just
> wouldn?t cope any more and hang. This feels more like hitting a wall.
> As if it found something that is couldn?t deal with and has no choice
> but to panic immediately.
>
> It may be possible that you have some underlying filesystem corruption
> that triggers this situation.  Have you actually tried doing a "zpool
> scrub" of your pools and seeing if any errors happen or if the panic
> occurs there?
>
> I'm inclined to think what you're experiencing is probably a bug or
> "quirk" in the storage controller driver you're using.  There are other
> drivers that have had fixes applied to them "to make them work decently
> with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
> the controller driver behaving oddly or freaking out, case in point.  It
> could also be a controller firmware bug/quirk/design issue.  Seriously.
>
> I believe the AOC-USAS2-L8i controller has been discussed on
> freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
> going to CC that list given that there would be 3 cross-posted lists
> involved and that is liable to upset some folks.  You should search the
> mailing lists for discussion of Supermicro controllers that work
> reliably with FreeBSD.
>
> It would be worthwhile to discuss this condition on -stable, mainly with
> something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
> You get the idea.
>
> --
> | Jeremy Chadwick                                jdc at parodius.com<http://parodius.com> |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                   Mountain View, CA, US |
> | Making life hard for others since 1977.               PGP 4BD6C0CB |
>
>
>
>
> Med Vänliga Hälsningar
> -------------------------------------------------------------------------------
> Karli Sjöberg
> Swedish University of Agricultural Sciences
> Box 7079 (Visiting Address Kronåsvägen 8)
> S-750 07 Uppsala, Sweden
> Phone:  +46-(0)18-67 15 66
> [hidden email]<mailto:[hidden email]>
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "[hidden email]"
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages

Douglas Gilbert-2
On 11-11-07 03:56 AM, Rich wrote:

> Observation - the LSI SAS expanders, in my experience, sometimes
> misbehave when there are drives which respond slower than some timeout
> to commands (as far as I've seen it's only SATA drives it does this
> for, but I don't have many SAS drives for comparison), leading to all
> further commands to that drive for a bit not working, and then what
> happens depending on the OS varies dramatically.
>
> If you could try without an expander (e.g. with 1->4 SAS->SATA fanout
> cables), you may be surprised (and/or annoyed) to find your life gets
> better.

SAS-2 expanders are better than the original generation.
[LSI makes both.] SAS-2 added the CONFIGURE GENERAL SMP
function which contains various timeout tweaks for the
STP protocol (i.e. the protocol that tunnels (S)ATA
commands between a SAS HBA (initiator) and an expander).

If you are using SAS-2 expanders and FreeBSD 9.0 then you
can fetch my smp_utils package and use the smp_conf_general
utility to change those timeout settings. If you have SAS-2
expanders but an older version of FreeBSD then you will
need Solaris or Linux to run my smp_utils package in order
to change those timeout values on the expander.

Doug Gilbert

BTW smp_rep_general will show the current settings of those
STP timeouts.

> On Mon, Nov 7, 2011 at 3:48 AM, Karli Sjöberg<[hidden email]>  wrote:
>> As a test, I have copied in about 1.5TB and scrubbed several times without any panic. It stayed solid until periodic weekly:( Same panic as with daily.
>>
>> /Karli Sjöberg
>>
>> 26 okt 2011 kl. 12.16 skrev Jeremy Chadwick:
>>
>> On Wed, Oct 26, 2011 at 11:36:44AM +0200, Karli Sj?berg wrote:
>> Hi all,
>>
>> I tracked down what causes the panics!
>>
>> I got a tip from aragon and phoenix at the forum about
>> /etc/periodic/security/100.chksetuid
>>
>> And to put:
>> daily_status_security_chksetuid_enable="NO"
>> into /etc/periodic.conf
>>
>> This is not truly the cause of the panic, it simply exacerbates it.
>>
>> Many of the periodic scripts will do things like iterate over all files
>> on the filesystem looking for specific attributes, etc..  This tends to
>> stress filesystems heavily.  This isn't the only one.  :-)
>>
>> I can now run periodic daily without any panics. I?m still wondering
>> about the cause of this, the explanation from the forum was that that
>> phase is too demanding for multi TB systems. But I have several multi
>> TB servers with FreeBSD and ZFS, and none of them has ever behaved
>> this way. Besides, the panic is instantaneous, not degenerative. I
>> imagine that a run like that would start out OK and then just get
>> worse and worse, getting gradually slower and slower until it just
>> wouldn?t cope any more and hang. This feels more like hitting a wall.
>> As if it found something that is couldn?t deal with and has no choice
>> but to panic immediately.
>>
>> It may be possible that you have some underlying filesystem corruption
>> that triggers this situation.  Have you actually tried doing a "zpool
>> scrub" of your pools and seeing if any errors happen or if the panic
>> occurs there?
>>
>> I'm inclined to think what you're experiencing is probably a bug or
>> "quirk" in the storage controller driver you're using.  There are other
>> drivers that have had fixes applied to them "to make them work decently
>> with ZFS", meaning the kind of stressful I/O ZFS puts on them results in
>> the controller driver behaving oddly or freaking out, case in point.  It
>> could also be a controller firmware bug/quirk/design issue.  Seriously.
>>
>> I believe the AOC-USAS2-L8i controller has been discussed on
>> freebsd-stable, re: mps(4) driver problems or equivalent, but I'm not
>> going to CC that list given that there would be 3 cross-posted lists
>> involved and that is liable to upset some folks.  You should search the
>> mailing lists for discussion of Supermicro controllers that work
>> reliably with FreeBSD.
>>
>> It would be worthwhile to discuss this condition on -stable, mainly with
>> something like "Anyone else using the AOC-USAS2-L8i reliably with ZFS?"
>> You get the idea.
>>
>> --
>> | Jeremy Chadwick                                jdc at parodius.com<http://parodius.com>  |
>> | Parodius Networking                       http://www.parodius.com/ |
>> | UNIX Systems Administrator                   Mountain View, CA, US |
>> | Making life hard for others since 1977.               PGP 4BD6C0CB |
>>
>>
>>
>>
>> Med Vänliga Hälsningar
>> -------------------------------------------------------------------------------
>> Karli Sjöberg
>> Swedish University of Agricultural Sciences
>> Box 7079 (Visiting Address Kronåsvägen 8)
>> S-750 07 Uppsala, Sweden
>> Phone:  +46-(0)18-67 15 66
>> [hidden email]<mailto:[hidden email]>
>>
>> _______________________________________________
>> [hidden email] mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "[hidden email]"
>>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"