problems with SAS JBODs 2

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

problems with SAS JBODs 2

Oliver Sech
Hi!
 
I use FreeBSD with for a large ZFS pool (over 1PB) and I recently encountered a lot of problems with the JBODs. Generally everything works fine until I replug the shelves.
 
When I start with a clean system and attach a single shelf every thing seems fine.
-> 44 disks show up, I can use the enclosure services (sesutil) and the system continues to run without problems.
Once I disconnect the SAS cable, wait until all devices disapear and reconnect I get all sorts of problems.
-> a random number of disks shows up and the enclosure "ses" do not show up
Once I restart the system I can start over again.
 
On the server with the large pool there are only certain ports on the HBA that I can use, otherwise disks will be missing after a reboot and my ZFS pool won't go online.
I tried different firmware on the HBA. I tried the mpr.ko module from the broadcom site. (I replaced the one in /boot/kernel?)
I tested all the things above with a Linux as OS and everything seems to work.
 
 
Is there anything I'm missing? A command that can reset the SAS components?
 
 
FreeBSD version: 11.1-RELEASE-p11
HBA: broadcom lsi 9305-16e (latest firmware)
JBOD:SC847E2C-R1K28JBOD (two expanders, internally daisy chained)
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Ben RUBSON
On 03 Jul 2018 14:28, Oliver Sech wrote:

> Once I disconnect the SAS cable, wait until all devices disapear and  
> reconnect I get all sorts of problems.
> -> a random number of disks shows up and the enclosure "ses" do not show up
> Once I restart the system I can start over again.

Hi,

I faced same sort of issue but with iSCSI disks.
At least disks did not disconnect properly, and did not reconnect until a  
reboot was performed.
Among the needed iSCSI patches, a GEOM one has been pushed :
https://github.com/freebsd/freebsd/commit/ea40366602be7548eba0bec35fec46ea4509dbb7
(it's in 11.2, but not in your 11.1).
Perhaps this could help.

Ben

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Alan Somers-2
In reply to this post by Oliver Sech
On Tue, Jul 3, 2018 at 6:28 AM, Oliver Sech <[hidden email]> wrote:

> Hi!
>
> I use FreeBSD with for a large ZFS pool (over 1PB) and I recently
> encountered a lot of problems with the JBODs. Generally everything works
> fine until I replug the shelves.
>
> When I start with a clean system and attach a single shelf every thing
> seems fine.
> -> 44 disks show up, I can use the enclosure services (sesutil) and the
> system continues to run without problems.
> Once I disconnect the SAS cable, wait until all devices disapear and
> reconnect I get all sorts of problems.
> -> a random number of disks shows up and the enclosure "ses" do not show up
> Once I restart the system I can start over again.
>
> On the server with the large pool there are only certain ports on the HBA
> that I can use, otherwise disks will be missing after a reboot and my ZFS
> pool won't go online.
> I tried different firmware on the HBA. I tried the mpr.ko module from the
> broadcom site. (I replaced the one in /boot/kernel?)
> I tested all the things above with a Linux as OS and everything seems to
> work.
>
>
> Is there anything I'm missing? A command that can reset the SAS components?
>
>
> FreeBSD version: 11.1-RELEASE-p11
> HBA: broadcom lsi 9305-16e (latest firmware)
> JBOD:SC847E2C-R1K28JBOD (two expanders, internally daisy chained)
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
>

1) Are the expanders daisy chained?  Some SAS expanders don't work reliably
when daisy chained.   Best to direct connect each one to the server.
2) Are the expanders connected in multipath or single path?  You need
geom_multipath if you're going to do that.
3) Are you attempting to use wide ports (two SAS cables connecting each
expander to the HBA).  If do, you'll need to make sure that each pair of
SAS cables goes to the same HBA chip (not merely the same card, as some
cards contain two HBA chips).
4) Are you trying to remove an expander while ZFS is active on that
expander?  That will suspend your pool, and ZFS doesn't always recover from
a suspended state.

-Alan
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Ken Merry
In reply to this post by Oliver Sech
On Tue, Jul 03, 2018 at 14:28:58 +0200, Oliver Sech wrote:

> Hi!
> ??
> I use FreeBSD with for a large ZFS pool (over 1PB) and I recently encountered a lot of problems with the JBODs. Generally everything works fine until I replug the shelves.
> ??
> When I start with a clean system and attach a single shelf every thing seems fine.
> -> 44 disks show up, I can use the enclosure services (sesutil) and the system continues to run without problems.
> Once I disconnect the SAS cable, wait until all devices disapear and reconnect I get all sorts of problems.
> -> a random number of disks shows up and the enclosure "ses" do not show up
> Once I restart the system I can start over again.
> ??
> On the server with the large pool there are only certain ports on the HBA that I can use, otherwise disks will be missing after a reboot and my ZFS pool won't go online.
> I tried different firmware on the HBA. I tried the mpr.ko module from the broadcom site. (I replaced the one in /boot/kernel?)
> I tested all the things above with a Linux as OS and everything seems to work.
> ??
> ??
> Is there anything I'm missing? A command that can reset the SAS components?
> ??
> ??
> FreeBSD version: 11.1-RELEASE-p11
> HBA: broadcom lsi 9305-16e (latest firmware)
> JBOD:SC847E2C-R1K28JBOD (two expanders, internally daisy chained)

Steve McConnell (CCed) and I have been corresponding with someone else who
has a problem very similar to yours.

The most likely issue is that the mapping table stored on the card is messed
up.  Can you send dmesg output with the following loader tunable set:

hw.mpr.debug_level=0x203

That will turn on debugging for the mapping code and may show the problem.

If you see messages like this:

mpr0: Attempting to reuse target id 63 handle 0x000b
mpr0: Attempting to reuse target id 64 handle 0x000c
mpr0: Attempting to reuse target id 65 handle 0x000d
mpr0: Attempting to reuse target id 66 handle 0x000e
mpr0: Attempting to reuse target id 67 handle 0x000f
mpr0: Attempting to reuse target id 68 handle 0x0010
mpr0: Attempting to reuse target id 69 handle 0x0011
mpr0: Attempting to reuse target id 70 handle 0x0012
mpr0: Attempting to reuse target id 66 handle 0x000e

It indicates that the mapping code is preventing some of the drives from
fully probing because there are collisions in the table.

Unfortunately we have not yet fixed the problem in the other situation.
(He is running with multipathing, which could be contributing to the
problem.)

I have a script and utility that will clear the mapping table in the card,
but that hasn't been enough to fix the other situation.  If you do have a
mapping problem, I can give you the script/utility to clear the table and
we can see whether it fixes your problem.

If not, it'll probably have to wait until Steve gets back from vacation.

Ken
--
Kenneth Merry
[hidden email]
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
In reply to this post by Alan Somers-2
> 1) Are the expanders daisy chained?  Some SAS expanders don't work reliably
> when daisy chained.   Best to direct connect each one to the server.
At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4 lanes?).
Unfortunately the JBOD has 24 slots in the front and 20 in the back and, those are connected via a internal SAS daisy chaining.
I could rewire and connect each backplane directly to the server, but unfortunately I do not have enough ports..

JOBD Model: Supermicro 847E2C-R1K28JBOD

> 2) Are the expanders connected in multipath or single path?  You need
> geom_multipath if you're going to do that.
See answer 1. There is a single path from the host to the first expander.

> 3) Are you attempting to use wide ports (two SAS cables connecting each
> expander to the HBA).  If do, you'll need to make sure that each pair of
> SAS cables goes to the same HBA chip (not merely the same card, as some
> cards contain two HBA chips).
see 1. The last time I opened one of those JBODs there were 8 SAS cables between the Front and Back expander. I assume that wide ports are being used.
(2 expanders per backplane as well)

> 4) Are you trying to remove an expander while ZFS is active on that
> expander?  That will suspend your pool, and ZFS doesn't always recover from
> a suspended state.
I'm testing with a new unused disk shelf that was never part of the ZFS pool. There were
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
In reply to this post by Ken Merry
> The most likely issue is that the mapping table stored on the card is messed
> up.  Can you send dmesg output with the following loader tunable set:
>
> hw.mpr.debug_level=0x203
>
> That will turn on debugging for the mapping code and may show the problem.
>
> If you see messages like this:
>
> mpr0: Attempting to reuse target id 63 handle 0x000b
> mpr0: Attempting to reuse target id 64 handle 0x000c
> mpr0: Attempting to reuse target id 65 handle 0x000d
> mpr0: Attempting to reuse target id 66 handle 0x000e
> mpr0: Attempting to reuse target id 67 handle 0x000f
> mpr0: Attempting to reuse target id 68 handle 0x0010
> mpr0: Attempting to reuse target id 69 handle 0x0011
> mpr0: Attempting to reuse target id 70 handle 0x0012
> mpr0: Attempting to reuse target id 66 handle 0x000e
>
> It indicates that the mapping code is preventing some of the drives from
> fully probing because there are collisions in the table.
>
> Unfortunately we have not yet fixed the problem in the other situation.
> (He is running with multipathing, which could be contributing to the
> problem.)
>
> I have a script and utility that will clear the mapping table in the card,
> but that hasn't been enough to fix the other situation.  If you do have a
> mapping problem, I can give you the script/utility to clear the table and
> we can see whether it fixes your problem.
>
> If not, it'll probably have to wait until Steve gets back from vacation.
>
> Ken

I added the "hw.mpr.debug_level" tunable and collected logs on the whole connect -> disconnect -> connect problem.

logs collected:
first connect log: https://paste.docker.ist.ac.at/?6ec80dde0e1f236f#NufbXSs6o+dTDTPgZgWbU8vRQ6B47tMbQ8LHPkMXfIg=
first connect sesutil: https://paste.docker.ist.ac.at/?256810338f87adc1#/N3m6iFH304SxSxpnHCt0ocOeAU8zkBennul2/BcKpQ=

disconnected shelf log: https://paste.docker.ist.ac.at/?07ff1129a6cb6117#8WH8AjO1sO2hZlHE39h314CoQxxFZmBVZNo+Q8+qp4Q=
disconnected shelf mprutil: https://paste.docker.ist.ac.at/?eebaee72dc9e1cfe#WTlnO5vlPb7997lJCMswWfwtcq1rN04CaFbxmMWHqrU=

second connect log: https://paste.docker.ist.ac.at/?684ff32c6dae185b#nZ32x023ApRvNKrVUhvCr7xi5cYJnPhs9XNTfEW6sMw=
second connect sesutil: https://paste.docker.ist.ac.at/?f0302ce3aa8e55d7#+ZaJsCUiLh/7VsqBJ5oPHxZtRbM1dVS2RankrXePikw=
second connect mprutil: https://paste.docker.ist.ac.at/?4b8d347aed941c1f#wX7y0cjtb2gYKLU99IIftmDcFpKiV2QqjcC7YN96nB0=


If you are interested in investigating this further I can try to organize a "test environment" as I'm pretty sure this issue is not limited to my hardware?

best regards,
Oliver
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
In reply to this post by Oliver Sech
I tested a few additional things. I don't think this is a multipath, daisy chain nor a SAS wide ports problem.
I can reproduce the problem with just a single connection to an Expander/JBOD.

Test:
* physically disconnect all shelves
* reboot system
* connect one shelf via SAS cable
* check number of disks (after a reboot everything always shows up)
* disconnect the shelf and wait (geom disk list still shows most disks.)
* connect the shelf (missing disks)

Tested Hardware:
* Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal daisy chain + wide links)
* Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight HBA <-> EXPANDER connection. (no wide links, no daisy chain))
* Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal daisy chain)
* Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight HBA <-> EXPANDER connection.)



On 07/04/2018 12:15 PM, Oliver Sech wrote:

>> 1) Are the expanders daisy chained?  Some SAS expanders don't work reliably
>> when daisy chained.   Best to direct connect each one to the server.
> At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4 lanes?).
> Unfortunately the JBOD has 24 slots in the front and 20 in the back and, those are connected via a internal SAS daisy chaining.
> I could rewire and connect each backplane directly to the server, but unfortunately I do not have enough ports..
>
> JOBD Model: Supermicro 847E2C-R1K28JBOD
>
>> 2) Are the expanders connected in multipath or single path?  You need
>> geom_multipath if you're going to do that.
> See answer 1. There is a single path from the host to the first expander.
>
>> 3) Are you attempting to use wide ports (two SAS cables connecting each
>> expander to the HBA).  If do, you'll need to make sure that each pair of
>> SAS cables goes to the same HBA chip (not merely the same card, as some
>> cards contain two HBA chips).
> see 1. The last time I opened one of those JBODs there were 8 SAS cables between the Front and Back expander. I assume that wide ports are being used.
> (2 expanders per backplane as well)
>
>> 4) Are you trying to remove an expander while ZFS is active on that
>> expander?  That will suspend your pool, and ZFS doesn't always recover from
>> a suspended state.
> I'm testing with a new unused disk shelf that was never part of the ZFS pool. There were
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

RE: problems with SAS JBODs 2

freebsd-scsi mailing list
Hi Oliver, I can't get to your links. Can you try to send the logs in
another way?

Steve

> -----Original Message-----
> From: [hidden email] [mailto:owner-freebsd-
> [hidden email]] On Behalf Of Oliver Sech
> Sent: Tuesday, July 10, 2018 9:14 AM
> To: FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> I tested a few additional things. I don't think this is a multipath, daisy
> chain
> nor a SAS wide ports problem.
> I can reproduce the problem with just a single connection to an
> Expander/JBOD.
>
> Test:
> * physically disconnect all shelves
> * reboot system
> * connect one shelf via SAS cable
> * check number of disks (after a reboot everything always shows up)
> * disconnect the shelf and wait (geom disk list still shows most disks.)
> * connect the shelf (missing disks)
>
> Tested Hardware:
> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal
> daisy
> chain + wide links)
> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight HBA
> <-
> > EXPANDER connection. (no wide links, no daisy chain))
> * Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal daisy
> chain)
> * Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight HBA
> <->
> EXPANDER connection.)
>
>
>
> On 07/04/2018 12:15 PM, Oliver Sech wrote:
> >> 1) Are the expanders daisy chained?  Some SAS expanders don't work
> reliably
> >> when daisy chained.   Best to direct connect each one to the server.
> > At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4
> lanes?).
> > Unfortunately the JBOD has 24 slots in the front and 20 in the back and,
> those are connected via a internal SAS daisy chaining.
> > I could rewire and connect each backplane directly to the server, but
> unfortunately I do not have enough ports..
> >
> > JOBD Model: Supermicro 847E2C-R1K28JBOD
> >
> >> 2) Are the expanders connected in multipath or single path?  You need
> >> geom_multipath if you're going to do that.
> > See answer 1. There is a single path from the host to the first
> > expander.
> >
> >> 3) Are you attempting to use wide ports (two SAS cables connecting each
> >> expander to the HBA).  If do, you'll need to make sure that each pair
> >> of
> >> SAS cables goes to the same HBA chip (not merely the same card, as some
> >> cards contain two HBA chips).
> > see 1. The last time I opened one of those JBODs there were 8 SAS cables
> between the Front and Back expander. I assume that wide ports are being
> used.
> > (2 expanders per backplane as well)
> >
> >> 4) Are you trying to remove an expander while ZFS is active on that
> >> expander?  That will suspend your pool, and ZFS doesn't always recover
> from
> >> a suspended state.
> > I'm testing with a new unused disk shelf that was never part of the ZFS
> pool. There were
> > _______________________________________________
> > [hidden email] mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "[hidden email]"
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

RE: problems with SAS JBODs 2

freebsd-scsi mailing list
In reply to this post by Oliver Sech
Ken, I looked at the logs and I don't see anything in them that suggests
that the driver is not adding any of the devices. In fact, I don't see
anything that looks strange at all. This looks like a different problem than
the other one you mentioned. What do you think?

Steve

> -----Original Message-----
> From: Stephen Mcconnell [mailto:[hidden email]]
> Sent: Tuesday, July 10, 2018 9:28 AM
> To: 'Oliver Sech'; 'FreeBSD-scsi'
> Subject: RE: problems with SAS JBODs 2
>
> Hi Oliver, I can't get to your links. Can you try to send the logs in
> another
> way?
>
> Steve
>
> > -----Original Message-----
> > From: [hidden email] [mailto:owner-freebsd-
> > [hidden email]] On Behalf Of Oliver Sech
> > Sent: Tuesday, July 10, 2018 9:14 AM
> > To: FreeBSD-scsi
> > Subject: Re: problems with SAS JBODs 2
> >
> > I tested a few additional things. I don't think this is a multipath,
> > daisy
> chain
> > nor a SAS wide ports problem.
> > I can reproduce the problem with just a single connection to an
> > Expander/JBOD.
> >
> > Test:
> > * physically disconnect all shelves
> > * reboot system
> > * connect one shelf via SAS cable
> > * check number of disks (after a reboot everything always shows up)
> > * disconnect the shelf and wait (geom disk list still shows most disks.)
> > * connect the shelf (missing disks)
> >
> > Tested Hardware:
> > * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal
> daisy
> > chain + wide links)
> > * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight HBA
> <-
> > > EXPANDER connection. (no wide links, no daisy chain))
> > * Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal
> > daisy
> > chain)
> > * Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight HBA
> > <->
> > EXPANDER connection.)
> >
> >
> >
> > On 07/04/2018 12:15 PM, Oliver Sech wrote:
> > >> 1) Are the expanders daisy chained?  Some SAS expanders don't work
> > reliably
> > >> when daisy chained.   Best to direct connect each one to the server.
> > > At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4
> > lanes?).
> > > Unfortunately the JBOD has 24 slots in the front and 20 in the back
> > > and,
> > those are connected via a internal SAS daisy chaining.
> > > I could rewire and connect each backplane directly to the server, but
> > unfortunately I do not have enough ports..
> > >
> > > JOBD Model: Supermicro 847E2C-R1K28JBOD
> > >
> > >> 2) Are the expanders connected in multipath or single path?  You need
> > >> geom_multipath if you're going to do that.
> > > See answer 1. There is a single path from the host to the first
> > > expander.
> > >
> > >> 3) Are you attempting to use wide ports (two SAS cables connecting
> each
> > >> expander to the HBA).  If do, you'll need to make sure that each pair
> > >> of
> > >> SAS cables goes to the same HBA chip (not merely the same card, as
> some
> > >> cards contain two HBA chips).
> > > see 1. The last time I opened one of those JBODs there were 8 SAS
> > > cables
> > between the Front and Back expander. I assume that wide ports are being
> > used.
> > > (2 expanders per backplane as well)
> > >
> > >> 4) Are you trying to remove an expander while ZFS is active on that
> > >> expander?  That will suspend your pool, and ZFS doesn't always
> > >> recover
> > from
> > >> a suspended state.
> > > I'm testing with a new unused disk shelf that was never part of the
> > > ZFS
> > pool. There were
> > > _______________________________________________
> > > [hidden email] mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > > To unsubscribe, send any mail to
> > > "[hidden email]"
> > _______________________________________________
> > [hidden email] mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
sorry for sending dead links earlier...
(Here is a link for the previous files: https://www.dropbox.com/s/5dlwizrzy48vme3/freebsd_sas.zip?dl=0 )
Here is the link for the new logs: https://www.dropbox.com/s/7bbt1fipg2a50oq/freebsd_sas2.zip?dl=0

notes:
logfile: "1_clean_boot_without_shelves_dmesg"

while booting with no shelves are attached it actually resets something:
mpr0: mpr_mapping_check_devices: Enclosure XX is missing from the topology. Update its missing count.
mpr0: _mapping_commit_enc_entry: Writing DPM entry XX for enclosure.

logfile: "3_shelf_disconnected_geom"
the only disks that really are connected are ada0, ada1, da0
everything else cannot get accessed

Hardware:
Promise    SAS2 VTrak 830 (Full of SATA disks) +  LSI 9305-16e

Oliver

On 07/10/2018 05:48 PM, Stephen Mcconnell wrote:

> Ken, I looked at the logs and I don't see anything in them that suggests
> that the driver is not adding any of the devices. In fact, I don't see
> anything that looks strange at all. This looks like a different problem than
> the other one you mentioned. What do you think?
>
> Steve
>
>> -----Original Message-----
>> From: Stephen Mcconnell [mailto:[hidden email]]
>> Sent: Tuesday, July 10, 2018 9:28 AM
>> To: 'Oliver Sech'; 'FreeBSD-scsi'
>> Subject: RE: problems with SAS JBODs 2
>>
>> Hi Oliver, I can't get to your links. Can you try to send the logs in
>> another
>> way?
>>
>> Steve
>>
>>> -----Original Message-----
>>> From: [hidden email] [mailto:owner-freebsd-
>>> [hidden email]] On Behalf Of Oliver Sech
>>> Sent: Tuesday, July 10, 2018 9:14 AM
>>> To: FreeBSD-scsi
>>> Subject: Re: problems with SAS JBODs 2
>>>
>>> I tested a few additional things. I don't think this is a multipath,
>>> daisy
>> chain
>>> nor a SAS wide ports problem.
>>> I can reproduce the problem with just a single connection to an
>>> Expander/JBOD.
>>>
>>> Test:
>>> * physically disconnect all shelves
>>> * reboot system
>>> * connect one shelf via SAS cable
>>> * check number of disks (after a reboot everything always shows up)
>>> * disconnect the shelf and wait (geom disk list still shows most disks.)
>>> * connect the shelf (missing disks)
>>>
>>> Tested Hardware:
>>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal
>> daisy
>>> chain + wide links)
>>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight HBA
>> <-
>>>> EXPANDER connection. (no wide links, no daisy chain))
>>> * Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal
>>> daisy
>>> chain)
>>> * Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight HBA
>>> <->
>>> EXPANDER connection.)
>>>
>>>
>>>
>>> On 07/04/2018 12:15 PM, Oliver Sech wrote:
>>>>> 1) Are the expanders daisy chained?  Some SAS expanders don't work
>>> reliably
>>>>> when daisy chained.   Best to direct connect each one to the server.
>>>> At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4
>>> lanes?).
>>>> Unfortunately the JBOD has 24 slots in the front and 20 in the back
>>>> and,
>>> those are connected via a internal SAS daisy chaining.
>>>> I could rewire and connect each backplane directly to the server, but
>>> unfortunately I do not have enough ports..
>>>>
>>>> JOBD Model: Supermicro 847E2C-R1K28JBOD
>>>>
>>>>> 2) Are the expanders connected in multipath or single path?  You need
>>>>> geom_multipath if you're going to do that.
>>>> See answer 1. There is a single path from the host to the first
>>>> expander.
>>>>
>>>>> 3) Are you attempting to use wide ports (two SAS cables connecting
>> each
>>>>> expander to the HBA).  If do, you'll need to make sure that each pair
>>>>> of
>>>>> SAS cables goes to the same HBA chip (not merely the same card, as
>> some
>>>>> cards contain two HBA chips).
>>>> see 1. The last time I opened one of those JBODs there were 8 SAS
>>>> cables
>>> between the Front and Back expander. I assume that wide ports are being
>>> used.
>>>> (2 expanders per backplane as well)
>>>>
>>>>> 4) Are you trying to remove an expander while ZFS is active on that
>>>>> expander?  That will suspend your pool, and ZFS doesn't always
>>>>> recover
>>> from
>>>>> a suspended state.
>>>> I'm testing with a new unused disk shelf that was never part of the
>>>> ZFS
>>> pool. There were
>>>> _______________________________________________
>>>> [hidden email] mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to
>>>> "[hidden email]"
>>> _______________________________________________
>>> [hidden email] mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Ken Merry
In reply to this post by freebsd-scsi mailing list
Yes, I agree, Oliver’s problem looks different.

Oliver, for your second set of files (freebsd_sas2.zip) it looks like you may have devices that aren’t completely going away, even from a SAS standpoint.

Here are the 25 target IDs that show up in 2_shelf_connected_dmesg.txt:

mpr0: mprsas_add_device: Target ID for added device is 467.
mpr0: mprsas_add_device: Target ID for added device is 468.
mpr0: mprsas_add_device: Target ID for added device is 469.
mpr0: mprsas_add_device: Target ID for added device is 470.
mpr0: mprsas_add_device: Target ID for added device is 471.
mpr0: mprsas_add_device: Target ID for added device is 472.
mpr0: mprsas_add_device: Target ID for added device is 473.
mpr0: mprsas_add_device: Target ID for added device is 474.
mpr0: mprsas_add_device: Target ID for added device is 475.
mpr0: mprsas_add_device: Target ID for added device is 476.
mpr0: mprsas_add_device: Target ID for added device is 477.
mpr0: mprsas_add_device: Target ID for added device is 478.
mpr0: mprsas_add_device: Target ID for added device is 479.
mpr0: mprsas_add_device: Target ID for added device is 480.
mpr0: mprsas_add_device: Target ID for added device is 481.
mpr0: mprsas_add_device: Target ID for added device is 482.
mpr0: mprsas_add_device: Target ID for added device is 483.
mpr0: mprsas_add_device: Target ID for added device is 484.
mpr0: mprsas_add_device: Target ID for added device is 485.
mpr0: mprsas_add_device: Target ID for added device is 486.
mpr0: mprsas_add_device: Target ID for added device is 487.
mpr0: mprsas_add_device: Target ID for added device is 488.
mpr0: mprsas_add_device: Target ID for added device is 489.
mpr0: mprsas_add_device: Target ID for added device is 490.
mpr0: mprsas_add_device: Target ID for added device is 503.

Here are the 8 target IDs that disappear in 3_shelf_disconnected_dmesg.txt:

mpr0: mprsas_prepare_remove: Sending reset for target ID 467
mpr0: mprsas_prepare_remove: Sending reset for target ID 468
mpr0: mprsas_prepare_remove: Sending reset for target ID 469
mpr0: mprsas_prepare_remove: Sending reset for target ID 470
mpr0: mprsas_prepare_remove: Sending reset for target ID 471
mpr0: mprsas_prepare_remove: Sending reset for target ID 472
mpr0: mprsas_prepare_remove: Sending reset for target ID 473
mpr0: mprsas_prepare_remove: Sending reset for target ID 474

And here are the same 8 target IDs getting added in 4_shelf_reconnected_dmesg.txt:

mpr0: mprsas_add_device: Target ID for added device is 467.
mpr0: mprsas_add_device: Target ID for added device is 468.
mpr0: mprsas_add_device: Target ID for added device is 469.
mpr0: mprsas_add_device: Target ID for added device is 470.
mpr0: mprsas_add_device: Target ID for added device is 471.
mpr0: mprsas_add_device: Target ID for added device is 472.
mpr0: mprsas_add_device: Target ID for added device is 473.
mpr0: mprsas_add_device: Target ID for added device is 474.

Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?

Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.

You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.

It seems like the adapter may not be recognizing that the devices in question have gone.

Steve, do you have any ideas what could be going on?

Ken

Ken Merry
[hidden email]



> On Jul 10, 2018, at 11:48 AM, Stephen Mcconnell via freebsd-scsi <[hidden email]> wrote:
>
> Ken, I looked at the logs and I don't see anything in them that suggests
> that the driver is not adding any of the devices. In fact, I don't see
> anything that looks strange at all. This looks like a different problem than
> the other one you mentioned. What do you think?
>
> Steve
>
>> -----Original Message-----
>> From: Stephen Mcconnell [mailto:[hidden email]]
>> Sent: Tuesday, July 10, 2018 9:28 AM
>> To: 'Oliver Sech'; 'FreeBSD-scsi'
>> Subject: RE: problems with SAS JBODs 2
>>
>> Hi Oliver, I can't get to your links. Can you try to send the logs in
>> another
>> way?
>>
>> Steve
>>
>>> -----Original Message-----
>>> From: [hidden email] [mailto:owner-freebsd-
>>> [hidden email]] On Behalf Of Oliver Sech
>>> Sent: Tuesday, July 10, 2018 9:14 AM
>>> To: FreeBSD-scsi
>>> Subject: Re: problems with SAS JBODs 2
>>>
>>> I tested a few additional things. I don't think this is a multipath,
>>> daisy
>> chain
>>> nor a SAS wide ports problem.
>>> I can reproduce the problem with just a single connection to an
>>> Expander/JBOD.
>>>
>>> Test:
>>> * physically disconnect all shelves
>>> * reboot system
>>> * connect one shelf via SAS cable
>>> * check number of disks (after a reboot everything always shows up)
>>> * disconnect the shelf and wait (geom disk list still shows most disks.)
>>> * connect the shelf (missing disks)
>>>
>>> Tested Hardware:
>>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal
>> daisy
>>> chain + wide links)
>>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight HBA
>> <-
>>>> EXPANDER connection. (no wide links, no daisy chain))
>>> * Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal
>>> daisy
>>> chain)
>>> * Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight HBA
>>> <->
>>> EXPANDER connection.)
>>>
>>>
>>>
>>> On 07/04/2018 12:15 PM, Oliver Sech wrote:
>>>>> 1) Are the expanders daisy chained?  Some SAS expanders don't work
>>> reliably
>>>>> when daisy chained.   Best to direct connect each one to the server.
>>>> At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4
>>> lanes?).
>>>> Unfortunately the JBOD has 24 slots in the front and 20 in the back
>>>> and,
>>> those are connected via a internal SAS daisy chaining.
>>>> I could rewire and connect each backplane directly to the server, but
>>> unfortunately I do not have enough ports..
>>>>
>>>> JOBD Model: Supermicro 847E2C-R1K28JBOD
>>>>
>>>>> 2) Are the expanders connected in multipath or single path?  You need
>>>>> geom_multipath if you're going to do that.
>>>> See answer 1. There is a single path from the host to the first
>>>> expander.
>>>>
>>>>> 3) Are you attempting to use wide ports (two SAS cables connecting
>> each
>>>>> expander to the HBA).  If do, you'll need to make sure that each pair
>>>>> of
>>>>> SAS cables goes to the same HBA chip (not merely the same card, as
>> some
>>>>> cards contain two HBA chips).
>>>> see 1. The last time I opened one of those JBODs there were 8 SAS
>>>> cables
>>> between the Front and Back expander. I assume that wide ports are being
>>> used.
>>>> (2 expanders per backplane as well)
>>>>
>>>>> 4) Are you trying to remove an expander while ZFS is active on that
>>>>> expander?  That will suspend your pool, and ZFS doesn't always
>>>>> recover
>>> from
>>>>> a suspended state.
>>>> I'm testing with a new unused disk shelf that was never part of the
>>>> ZFS
>>> pool. There were
>>>> _______________________________________________
>>>> [hidden email] mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to
>>>> "[hidden email]"
>>> _______________________________________________
>>> [hidden email] mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "[hidden email]"
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
slm
Reply | Threaded
Open this post in threaded view
|

RE: problems with SAS JBODs 2

slm
I'm think this is a mapping table problem or the use_phy_num problem. I'm
having Oliver change the use_phy_num sysctl values to 0 and then use your
script to clear out the controller mapping entries to see what happens.

Steve

> -----Original Message-----
> From: Ken Merry [mailto:[hidden email]]
> Sent: Wednesday, July 11, 2018 2:35 PM
> To: Stephen Mcconnell; Oliver Sech
> Cc: FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> Yes, I agree, Oliver’s problem looks different.
>
> Oliver, for your second set of files (freebsd_sas2.zip) it looks like you
> may
> have devices that aren’t completely going away, even from a SAS
> standpoint.
>
> Here are the 25 target IDs that show up in 2_shelf_connected_dmesg.txt:
>
> mpr0: mprsas_add_device: Target ID for added device is 467.
> mpr0: mprsas_add_device: Target ID for added device is 468.
> mpr0: mprsas_add_device: Target ID for added device is 469.
> mpr0: mprsas_add_device: Target ID for added device is 470.
> mpr0: mprsas_add_device: Target ID for added device is 471.
> mpr0: mprsas_add_device: Target ID for added device is 472.
> mpr0: mprsas_add_device: Target ID for added device is 473.
> mpr0: mprsas_add_device: Target ID for added device is 474.
> mpr0: mprsas_add_device: Target ID for added device is 475.
> mpr0: mprsas_add_device: Target ID for added device is 476.
> mpr0: mprsas_add_device: Target ID for added device is 477.
> mpr0: mprsas_add_device: Target ID for added device is 478.
> mpr0: mprsas_add_device: Target ID for added device is 479.
> mpr0: mprsas_add_device: Target ID for added device is 480.
> mpr0: mprsas_add_device: Target ID for added device is 481.
> mpr0: mprsas_add_device: Target ID for added device is 482.
> mpr0: mprsas_add_device: Target ID for added device is 483.
> mpr0: mprsas_add_device: Target ID for added device is 484.
> mpr0: mprsas_add_device: Target ID for added device is 485.
> mpr0: mprsas_add_device: Target ID for added device is 486.
> mpr0: mprsas_add_device: Target ID for added device is 487.
> mpr0: mprsas_add_device: Target ID for added device is 488.
> mpr0: mprsas_add_device: Target ID for added device is 489.
> mpr0: mprsas_add_device: Target ID for added device is 490.
> mpr0: mprsas_add_device: Target ID for added device is 503.
>
> Here are the 8 target IDs that disappear in
> 3_shelf_disconnected_dmesg.txt:
>
> mpr0: mprsas_prepare_remove: Sending reset for target ID 467
> mpr0: mprsas_prepare_remove: Sending reset for target ID 468
> mpr0: mprsas_prepare_remove: Sending reset for target ID 469
> mpr0: mprsas_prepare_remove: Sending reset for target ID 470
> mpr0: mprsas_prepare_remove: Sending reset for target ID 471
> mpr0: mprsas_prepare_remove: Sending reset for target ID 472
> mpr0: mprsas_prepare_remove: Sending reset for target ID 473
> mpr0: mprsas_prepare_remove: Sending reset for target ID 474
>
> And here are the same 8 target IDs getting added in
> 4_shelf_reconnected_dmesg.txt:
>
> mpr0: mprsas_add_device: Target ID for added device is 467.
> mpr0: mprsas_add_device: Target ID for added device is 468.
> mpr0: mprsas_add_device: Target ID for added device is 469.
> mpr0: mprsas_add_device: Target ID for added device is 470.
> mpr0: mprsas_add_device: Target ID for added device is 471.
> mpr0: mprsas_add_device: Target ID for added device is 472.
> mpr0: mprsas_add_device: Target ID for added device is 473.
> mpr0: mprsas_add_device: Target ID for added device is 474.
>
> Oliver, what happens when you try to do I/O to the devices that don’t go
> away after you pull the cable?  Does that cause the devices to go away?
>
> Looking at the mprutil output, it also shows the devices sticking around
> from
> the adapter’s standpoint.
>
> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’
> (where N
> is the scbus number shown by ‘camcontrol devlist -v’).  That will do some
> basic probes for each of the devices and should in theory cause them to go
> away if they aren’t accessible.
>
> It seems like the adapter may not be recognizing that the devices in
> question
> have gone.
>
> Steve, do you have any ideas what could be going on?
>
> Ken
> —
> Ken Merry
> [hidden email]
>
>
>
> > On Jul 10, 2018, at 11:48 AM, Stephen Mcconnell via freebsd-scsi
> > <freebsd-
> [hidden email]> wrote:
> >
> > Ken, I looked at the logs and I don't see anything in them that suggests
> > that the driver is not adding any of the devices. In fact, I don't see
> > anything that looks strange at all. This looks like a different problem
> > than
> > the other one you mentioned. What do you think?
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Stephen Mcconnell [mailto:[hidden email]]
> >> Sent: Tuesday, July 10, 2018 9:28 AM
> >> To: 'Oliver Sech'; 'FreeBSD-scsi'
> >> Subject: RE: problems with SAS JBODs 2
> >>
> >> Hi Oliver, I can't get to your links. Can you try to send the logs in
> >> another
> >> way?
> >>
> >> Steve
> >>
> >>> -----Original Message-----
> >>> From: [hidden email] [mailto:owner-freebsd-
> >>> [hidden email]] On Behalf Of Oliver Sech
> >>> Sent: Tuesday, July 10, 2018 9:14 AM
> >>> To: FreeBSD-scsi
> >>> Subject: Re: problems with SAS JBODs 2
> >>>
> >>> I tested a few additional things. I don't think this is a multipath,
> >>> daisy
> >> chain
> >>> nor a SAS wide ports problem.
> >>> I can reproduce the problem with just a single connection to an
> >>> Expander/JBOD.
> >>>
> >>> Test:
> >>> * physically disconnect all shelves
> >>> * reboot system
> >>> * connect one shelf via SAS cable
> >>> * check number of disks (after a reboot everything always shows up)
> >>> * disconnect the shelf and wait (geom disk list still shows most
> >>> disks.)
> >>> * connect the shelf (missing disks)
> >>>
> >>> Tested Hardware:
> >>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e ( internal
> >> daisy
> >>> chain + wide links)
> >>> * Supermicro SAS3 847E2C-R1K28JBOD     + SAS3 LSI 9305-16e (straight
> HBA
> >> <-
> >>>> EXPANDER connection. (no wide links, no daisy chain))
> >>> * Supermicro SAS2 SC847E26-RJBOD1      + SAS3 LSI 9305-16e (internal
> >>> daisy
> >>> chain)
> >>> * Promise    SAS2 VTrak 830            + SAS3 LSI 9305-16e (straight
> >>> HBA
> >>> <->
> >>> EXPANDER connection.)
> >>>
> >>>
> >>>
> >>> On 07/04/2018 12:15 PM, Oliver Sech wrote:
> >>>>> 1) Are the expanders daisy chained?  Some SAS expanders don't work
> >>> reliably
> >>>>> when daisy chained.   Best to direct connect each one to the server.
> >>>> At the moment I have 1 JBOD connected to 1 HBA Port with 1 cable (4
> >>> lanes?).
> >>>> Unfortunately the JBOD has 24 slots in the front and 20 in the back
> >>>> and,
> >>> those are connected via a internal SAS daisy chaining.
> >>>> I could rewire and connect each backplane directly to the server, but
> >>> unfortunately I do not have enough ports..
> >>>>
> >>>> JOBD Model: Supermicro 847E2C-R1K28JBOD
> >>>>
> >>>>> 2) Are the expanders connected in multipath or single path?  You
> need
> >>>>> geom_multipath if you're going to do that.
> >>>> See answer 1. There is a single path from the host to the first
> >>>> expander.
> >>>>
> >>>>> 3) Are you attempting to use wide ports (two SAS cables connecting
> >> each
> >>>>> expander to the HBA).  If do, you'll need to make sure that each
> >>>>> pair
> >>>>> of
> >>>>> SAS cables goes to the same HBA chip (not merely the same card, as
> >> some
> >>>>> cards contain two HBA chips).
> >>>> see 1. The last time I opened one of those JBODs there were 8 SAS
> >>>> cables
> >>> between the Front and Back expander. I assume that wide ports are
> being
> >>> used.
> >>>> (2 expanders per backplane as well)
> >>>>
> >>>>> 4) Are you trying to remove an expander while ZFS is active on that
> >>>>> expander?  That will suspend your pool, and ZFS doesn't always
> >>>>> recover
> >>> from
> >>>>> a suspended state.
> >>>> I'm testing with a new unused disk shelf that was never part of the
> >>>> ZFS
> >>> pool. There were
> >>>> _______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >>>> To unsubscribe, send any mail to
> >>>> "[hidden email]"
> >>> _______________________________________________
> >>> [hidden email] mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >>> To unsubscribe, send any mail to "freebsd-scsi-
> [hidden email]"
> > _______________________________________________
> > [hidden email] mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
In reply to this post by Ken Merry
On 07/11/2018 10:35 PM, Ken Merry wrote:
> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?

I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.

> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>
> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>
> It seems like the adapter may not be recognizing that the devices in question have gone.


I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.

Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:
* The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR:  Failed to initialize PAL.  Exiting program.)
The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
(This time I wiped them before flashing with "sas3flash -o -e 6".)

* I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.
* I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)

* I retested this hardware with Linux (4.15 and 4.17)
** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)

* I tired a different controller
** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
With the new controller everything seems work on Linux. It might be the old Firmware?...
It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up...


This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Ben RUBSON
On 12 Jul 2018 12:00, Oliver Sech wrote:

> It is better with the new controller on FreeBSD in that sense that I at  
> least get one out of two /dev/sesX devices back. But disks are still  
> missing and are not getting completely cleaned up...

On 03 Jul 2018 14:54, Ben RUBSON wrote:

> I faced same sort of issue but with iSCSI disks.
> At least disks did not disconnect properly, and did not reconnect until a  
> reboot was performed.
> Among the needed iSCSI patches, a GEOM one has been pushed :
> https://github.com/freebsd/freebsd/commit/ea40366602be7548eba0bec35fec46ea4509dbb7
> (it's in 11.2, but not in your 11.1).
> Perhaps this could help.

Did you try with FreeBSD 11.2 or with the kernel patch I proposed above ?

Ben

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Ken Merry
In reply to this post by Oliver Sech

> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]> wrote:
>
> On 07/11/2018 10:35 PM, Ken Merry wrote:
>> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?
>
> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.

Ok, that’s good.  Can you send the dmesg output and check with ‘camcontrol devlist -v’ to make sure the device has fully gone away?

The reason I ask is that I have spent lots of time over the years debugging device arrival and departure problems in CAM, GEOM and devfs, and I want to make sure we aren’t running into any non-SAS related problems.

>
>> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>>
>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>>
>> It seems like the adapter may not be recognizing that the devices in question have gone.
>
>
> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.

If doing a read from the device with dd makes it go away, ‘camcontrol rescan all’ should make it go away as well.  It sends command to every device, and if the mpr(4) driver tells CAM the drive is no longer there, it’ll get removed.

If it doesn’t cause the device to get removed (and the rescan doesn’t hang), it means that you’re getting a response from a device that is no longer physically connected to the machine, which is impossible with SAS.

>
> Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:

Steve’s email made it sound like he was going to send it.  I just sent it to you separately.

> * The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
> So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR:  Failed to initialize PAL.  Exiting program.)
> The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
> (This time I wiped them before flashing with "sas3flash -o -e 6”.)

That is unfortunate…perhaps Steve has some insight.

>
> * I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.

Changed it to what?  I think it defaults to 1.  Did you try 0?

> * I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)
>
> * I retested this hardware with Linux (4.15 and 4.17)
> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
> ** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)
>
> * I tired a different controller
> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
> With the new controller everything seems work on Linux. It might be the old Firmware?...
> It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up…

It does sound a bit like a mapping table problem.  Clearing it might help, we’ll see.

> This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.

Thanks for debugging this.  Unfortunately there are a number of ways it can go wrong.  The mapping code has been the source of some problems, sometimes enclosure vendors do the wrong thing, and sometimes there are other bugs.

Ken  

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
Sorry for the delay. I moved to a different office and could not focus on this issue last week.

I tested all of the hardware with different drivers and firmware on Linux to make sure this is not a hardware problem:
* Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
* Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
* Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disks after reconnect)
* Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disks after reconnect)

I tested a different HBA with an old firmware as well and there were no issues. Only with the latest FW disks are missing after a reconnect with the error "mpt3sas_cm0: "device is not present handle"
I don't know yet how different Firmware behaves between version 09.00.000.00 and 16...

Additional Info/Changes:
* Upgraded testsystem to 11.2 as suggested in the mailing list. -> No Change
* "camcontrol rescan all" removes the devices that are still present after the cable has been removed. "camcontrol devlist -v" does not show them anymore


Setting the driver "use_phy_num" to 0 and using the clearDPM script between connects does not help. In fact I do not see a different behavior at all?
I reflashed the controller multiple times and erased everything except the "manufacturing" area to make sure that no previous settings are kept.
The only thing I know that "fixes" the missing drives is to reboot the server.

A (similar?) problem also occurs once I start the server with all 6 disk shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up properly with 5 shelves, once I offline connect the 6th shelve, then some random disks are missing and I cannot longer import the ZFS pool.

The following logs were collected with the very old FW 09.00.101.00 that worked on Linux.
Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0

best regards,
Oliver

On 07/12/2018 03:38 PM, Ken Merry wrote:

>
>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]> wrote:
>>
>> On 07/11/2018 10:35 PM, Ken Merry wrote:
>>> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?
>>
>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.
>
> Ok, that’s good.  Can you send the dmesg output and check with ‘camcontrol devlist -v’ to make sure the device has fully gone away?
>
> The reason I ask is that I have spent lots of time over the years debugging device arrival and departure problems in CAM, GEOM and devfs, and I want to make sure we aren’t running into any non-SAS related problems.
>
>>
>>> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>>>
>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>>>
>>> It seems like the adapter may not be recognizing that the devices in question have gone.
>>
>>
>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.
>
> If doing a read from the device with dd makes it go away, ‘camcontrol rescan all’ should make it go away as well.  It sends command to every device, and if the mpr(4) driver tells CAM the drive is no longer there, it’ll get removed.
>
> If it doesn’t cause the device to get removed (and the rescan doesn’t hang), it means that you’re getting a response from a device that is no longer physically connected to the machine, which is impossible with SAS.
>
>>
>> Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:
>
> Steve’s email made it sound like he was going to send it.  I just sent it to you separately.
>
>> * The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
>> So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR:  Failed to initialize PAL.  Exiting program.)
>> The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
>
> That is unfortunate…perhaps Steve has some insight.
>
>>
>> * I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.
>
> Changed it to what?  I think it defaults to 1.  Did you try 0?
>
>> * I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)
>>
>> * I retested this hardware with Linux (4.15 and 4.17)
>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
>> ** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)
>>
>> * I tired a different controller
>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
>> With the new controller everything seems work on Linux. It might be the old Firmware?...
>> It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up…
>
> It does sound a bit like a mapping table problem.  Clearing it might help, we’ll see.
>
>> This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.
>
> Thanks for debugging this.  Unfortunately there are a number of ways it can go wrong.  The mapping code has been the source of some problems, sometimes enclosure vendors do the wrong thing, and sometimes there are other bugs.
>
> Ken  
>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
update 2: I continued to test with more and different hardware.

tested with a LSI SAS9207-8e HBA:
* after disconnect all devices properly disappear /dev/daX /dev/ses
no rescans or writing necessary
* no more targets in mpsutil (not mprutil)
* after reconnect all disks and all ses devs appear!

tested with hardware raid LSI SAS 9286CV-8e
* no problems with the shelf/sas in different configurations
* switching the controller and importing configuration works reliably

So far I think there is a problem with the mpr driver and I'm quite confident that it does affect other people.
With a simple configuration is probably not immediately noticeable as everything seems to work after the first connect/boot.
It probably gets scarier for people with multipathing and big SAS chains I guess...

I will downgrade to SAS2 HBAs shortly as I'm running out of space. If there is anything I can help with while I still have hardware in the lab let me know.

Oliver

On 07/23/2018 04:14 PM, Oliver Sech wrote:

> Sorry for the delay. I moved to a different office and could not focus on this issue last week.
>
> I tested all of the hardware with different drivers and firmware on Linux to make sure this is not a hardware problem:
> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disks after reconnect)
> * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disks after reconnect)
>
> I tested a different HBA with an old firmware as well and there were no issues. Only with the latest FW disks are missing after a reconnect with the error "mpt3sas_cm0: "device is not present handle"
> I don't know yet how different Firmware behaves between version 09.00.000.00 and 16...
>
> Additional Info/Changes:
> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No Change
> * "camcontrol rescan all" removes the devices that are still present after the cable has been removed. "camcontrol devlist -v" does not show them anymore
>
>
> Setting the driver "use_phy_num" to 0 and using the clearDPM script between connects does not help. In fact I do not see a different behavior at all?
> I reflashed the controller multiple times and erased everything except the "manufacturing" area to make sure that no previous settings are kept.
> The only thing I know that "fixes" the missing drives is to reboot the server.
>
> A (similar?) problem also occurs once I start the server with all 6 disk shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up properly with 5 shelves, once I offline connect the 6th shelve, then some random disks are missing and I cannot longer import the ZFS pool.
>
> The following logs were collected with the very old FW 09.00.101.00 that worked on Linux.
> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
>
> best regards,
> Oliver
>
> On 07/12/2018 03:38 PM, Ken Merry wrote:
>>
>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]> wrote:
>>>
>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
>>>> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?
>>>
>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.
>>
>> Ok, that’s good.  Can you send the dmesg output and check with ‘camcontrol devlist -v’ to make sure the device has fully gone away?
>>
>> The reason I ask is that I have spent lots of time over the years debugging device arrival and departure problems in CAM, GEOM and devfs, and I want to make sure we aren’t running into any non-SAS related problems.
>>
>>>
>>>> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>>>>
>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>>>>
>>>> It seems like the adapter may not be recognizing that the devices in question have gone.
>>>
>>>
>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.
>>
>> If doing a read from the device with dd makes it go away, ‘camcontrol rescan all’ should make it go away as well.  It sends command to every device, and if the mpr(4) driver tells CAM the drive is no longer there, it’ll get removed.
>>
>> If it doesn’t cause the device to get removed (and the rescan doesn’t hang), it means that you’re getting a response from a device that is no longer physically connected to the machine, which is impossible with SAS.
>>
>>>
>>> Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:
>>
>> Steve’s email made it sound like he was going to send it.  I just sent it to you separately.
>>
>>> * The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
>>> So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR:  Failed to initialize PAL.  Exiting program.)
>>> The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
>>
>> That is unfortunate…perhaps Steve has some insight.
>>
>>>
>>> * I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.
>>
>> Changed it to what?  I think it defaults to 1.  Did you try 0?
>>
>>> * I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)
>>>
>>> * I retested this hardware with Linux (4.15 and 4.17)
>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)
>>>
>>> * I tired a different controller
>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
>>> With the new controller everything seems work on Linux. It might be the old Firmware?...
>>> It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up…
>>
>> It does sound a bit like a mapping table problem.  Clearing it might help, we’ll see.
>>
>>> This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.
>>
>> Thanks for debugging this.  Unfortunately there are a number of ways it can go wrong.  The mapping code has been the source of some problems, sometimes enclosure vendors do the wrong thing, and sometimes there are other bugs.
>>
>> Ken  
>>
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

RE: problems with SAS JBODs 2

freebsd-scsi mailing list
Oliver, can you try changing the mapping mode on the controller? I think
you're using Enclosure/Slot Mapping and I want to see what happens with
Device Persistent Mapping. To do that, follow these steps:
1. Run Ken’s script to clear the DPM entries
2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page
Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
Enclosure/Slot Mapping and I'd like you to change this. You will be asked if
you want to make changes. Select ‘yes’ and then change offset 0x0C to
00000001 (you might have to type C instead of 0x0C for the offset). Just use
the default setting to change NVRAM.
3. Reboot and see what happens and let me know how it goes.


Steve

> -----Original Message-----
> From: [hidden email] [mailto:owner-freebsd-
> [hidden email]] On Behalf Of Oliver Sech
> Sent: Tuesday, July 24, 2018 12:23 PM
> To: FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> update 2: I continued to test with more and different hardware.
>
> tested with a LSI SAS9207-8e HBA:
> * after disconnect all devices properly disappear /dev/daX /dev/ses
> no rescans or writing necessary
> * no more targets in mpsutil (not mprutil)
> * after reconnect all disks and all ses devs appear!
>
> tested with hardware raid LSI SAS 9286CV-8e
> * no problems with the shelf/sas in different configurations
> * switching the controller and importing configuration works reliably
>
> So far I think there is a problem with the mpr driver and I'm quite
> confident
> that it does affect other people.
> With a simple configuration is probably not immediately noticeable as
> everything seems to work after the first connect/boot.
> It probably gets scarier for people with multipathing and big SAS chains I
> guess...
>
> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
> there is
> anything I can help with while I still have hardware in the lab let me
> know.
>
> Oliver
>
> On 07/23/2018 04:14 PM, Oliver Sech wrote:
> > Sorry for the delay. I moved to a different office and could not focus
> > on
> this issue last week.
> >
> > I tested all of the hardware with different drivers and firmware on
> > Linux to
> make sure this is not a hardware problem:
> > * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> > * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> > * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disks
> after reconnect)
> > * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disks
> after reconnect)
> >
> > I tested a different HBA with an old firmware as well and there were no
> issues. Only with the latest FW disks are missing after a reconnect with
> the
> error "mpt3sas_cm0: "device is not present handle"
> > I don't know yet how different Firmware behaves between version
> 09.00.000.00 and 16...
> >
> > Additional Info/Changes:
> > * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
> Change
> > * "camcontrol rescan all" removes the devices that are still present
> > after
> the cable has been removed. "camcontrol devlist -v" does not show them
> anymore
> >
> >
> > Setting the driver "use_phy_num" to 0 and using the clearDPM script
> between connects does not help. In fact I do not see a different behavior
> at
> all?
> > I reflashed the controller multiple times and erased everything except
> > the
> "manufacturing" area to make sure that no previous settings are kept.
> > The only thing I know that "fixes" the missing drives is to reboot the
> > server.
> >
> > A (similar?) problem also occurs once I start the server with all 6 disk
> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
> properly with 5 shelves, once I offline connect the 6th shelve, then some
> random disks are missing and I cannot longer import the ZFS pool.
> >
> > The following logs were collected with the very old FW 09.00.101.00 that
> worked on Linux.
> > Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
> >
> > best regards,
> > Oliver
> >
> > On 07/12/2018 03:38 PM, Ken Merry wrote:
> >>
> >>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]>
> wrote:
> >>>
> >>> On 07/11/2018 10:35 PM, Ken Merry wrote:
> >>>> Oliver, what happens when you try to do I/O to the devices that don’t
> go away after you pull the cable?  Does that cause the devices to go away?
> >>>
> >>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least
> >>> the
> "da" device disappears.
> >>
> >> Ok, that’s good.  Can you send the dmesg output and check with
> ‘camcontrol devlist -v’ to make sure the device has fully gone away?
> >>
> >> The reason I ask is that I have spent lots of time over the years
> >> debugging
> device arrival and departure problems in CAM, GEOM and devfs, and I want
> to make sure we aren’t running into any non-SAS related problems.
> >>
> >>>
> >>>> Looking at the mprutil output, it also shows the devices sticking
> >>>> around
> from the adapter’s standpoint.
> >>>>
> >>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’
> (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will
> do
> some basic probes for each of the devices and should in theory cause them
> to go away if they aren’t accessible.
> >>>>
> >>>> It seems like the adapter may not be recognizing that the devices in
> question have gone.
> >>>
> >>>
> >>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times.
> >>> While
> I not sure anymore if that cleans up the non-working devices, I'm sure
> that
> no new devices were added.
> >>
> >> If doing a read from the device with dd makes it go away, ‘camcontrol
> rescan all’ should make it go away as well.  It sends command to every
> device, and if the mpr(4) driver tells CAM the drive is no longer there,
> it’ll get
> removed.
> >>
> >> If it doesn’t cause the device to get removed (and the rescan doesn’t
> hang), it means that you’re getting a response from a device that is no
> longer physically connected to the machine, which is impossible with SAS.
> >>
> >>>
> >>> Unfortunately I haven't gotten yet to Steves 'clear controller
> >>> mapping'
> script but I did a few other things:
> >>
> >> Steve’s email made it sound like he was going to send it.  I just sent
> >> it to
> you separately.
> >>
> >>> * The last time I tried to upgrade the firmware I had all sorts of
> problems. "sas3flash" reported bad checksums while flashing some of the
> files.
> >>> So I reflashed both controllers with the DOS version of sas3flash.
> >>> This
> was basically a challenge in itself because the DOS version of this
> utility does
> not seem to run on computers of this decade. (ERROR:  Failed to initialize
> PAL.  Exiting program.)
> >>> The equivalent sas3flash.EFI version seems to be out of date and
> >>> caused
> the checksum problems described before.
> >>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
> >>
> >> That is unfortunate…perhaps Steve has some insight.
> >>
> >>>
> >>> * I tried to change mpr tuneable "use_phy_num" after that but this has
> not improved the situation. I will retry and collect logs with Steves
> script.
> >>
> >> Changed it to what?  I think it defaults to 1.  Did you try 0?
> >>
> >>> * I retried with the latest "mpr.ko" from the broadcom download page.
> (Same problems, no "use_phy_num" tuneable.)
> >>>
> >>> * I retested this hardware with Linux (4.15 and 4.17)
> >>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45
> disks disappear, 45 disks reappear)
> >>> ** The newest shelf 2 disks were missing after the replugging (ie: 44
> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
> mpt3sas_cm0: "device is not present handle)
> >>>
> >>> * I tired a different controller
> >>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
> (Firmware 16.00.01.00 or 15.00.00.00)
> >>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar
> with 09*))
> >>> With the new controller everything seems work on Linux. It might be
> >>> the
> old Firmware?...
> >>> It is better with the new controller on FreeBSD in that sense that I
> >>> at
> least get one out of two /dev/sesX devices back. But disks are still
> missing
> and are not getting completely cleaned up…
> >>
> >> It does sound a bit like a mapping table problem.  Clearing it might
> >> help,
> we’ll see.
> >>
> >>> This whole thing is a bit frustrating, especially since up until now I
> thought that HBAs are kind of "connect and forget" devices. Next step is
> to
> set up a separate test environment and try to get it to work there. I will
> keep
> you updated and try provide log for all FreeBSD related problems.
> >>
> >> Thanks for debugging this.  Unfortunately there are a number of ways it
> can go wrong.  The mapping code has been the source of some problems,
> sometimes enclosure vendors do the wrong thing, and sometimes there are
> other bugs.
> >>
> >> Ken
> >>
> > _______________________________________________
> > [hidden email] mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "[hidden email]"
> >
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: problems with SAS JBODs 2

Oliver Sech
I ran the clear_dpm.sh script and changed the value you suggested. Rebooted and retested. As far as I can tell there is no difference.

I tried the menu option (99.  Reset port) in lsiutil and this helps with missing devices. After reseting the port I get all my disks and ses devs again.

Read NVRAM or current values?  [0=NVRAM, 1=Current, default is 0]

0000 : 21080600
0004 : 00000001
0008 : 00180080
000c : 00000001
0010 : 00000000
0014 : 00000000

On 07/24/2018 10:22 PM, Stephen Mcconnell wrote:

> Oliver, can you try changing the mapping mode on the controller? I think
> you're using Enclosure/Slot Mapping and I want to see what happens with
> Device Persistent Mapping. To do that, follow these steps:
> 1. Run Ken’s script to clear the DPM entries
> 2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page
> Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
> Enclosure/Slot Mapping and I'd like you to change this. You will be asked if
> you want to make changes. Select ‘yes’ and then change offset 0x0C to
> 00000001 (you might have to type C instead of 0x0C for the offset). Just use
> the default setting to change NVRAM.
> 3. Reboot and see what happens and let me know how it goes.
>
>
> Steve
>
>> -----Original Message-----
>> From: [hidden email] [mailto:owner-freebsd-
>> [hidden email]] On Behalf Of Oliver Sech
>> Sent: Tuesday, July 24, 2018 12:23 PM
>> To: FreeBSD-scsi
>> Subject: Re: problems with SAS JBODs 2
>>
>> update 2: I continued to test with more and different hardware.
>>
>> tested with a LSI SAS9207-8e HBA:
>> * after disconnect all devices properly disappear /dev/daX /dev/ses
>> no rescans or writing necessary
>> * no more targets in mpsutil (not mprutil)
>> * after reconnect all disks and all ses devs appear!
>>
>> tested with hardware raid LSI SAS 9286CV-8e
>> * no problems with the shelf/sas in different configurations
>> * switching the controller and importing configuration works reliably
>>
>> So far I think there is a problem with the mpr driver and I'm quite
>> confident
>> that it does affect other people.
>> With a simple configuration is probably not immediately noticeable as
>> everything seems to work after the first connect/boot.
>> It probably gets scarier for people with multipathing and big SAS chains I
>> guess...
>>
>> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
>> there is
>> anything I can help with while I still have hardware in the lab let me
>> know.
>>
>> Oliver
>>
>> On 07/23/2018 04:14 PM, Oliver Sech wrote:
>>> Sorry for the delay. I moved to a different office and could not focus
>>> on
>> this issue last week.
>>>
>>> I tested all of the hardware with different drivers and firmware on
>>> Linux to
>> make sure this is not a hardware problem:
>>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
>>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
>>> * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disks
>> after reconnect)
>>> * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disks
>> after reconnect)
>>>
>>> I tested a different HBA with an old firmware as well and there were no
>> issues. Only with the latest FW disks are missing after a reconnect with
>> the
>> error "mpt3sas_cm0: "device is not present handle"
>>> I don't know yet how different Firmware behaves between version
>> 09.00.000.00 and 16...
>>>
>>> Additional Info/Changes:
>>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
>> Change
>>> * "camcontrol rescan all" removes the devices that are still present
>>> after
>> the cable has been removed. "camcontrol devlist -v" does not show them
>> anymore
>>>
>>>
>>> Setting the driver "use_phy_num" to 0 and using the clearDPM script
>> between connects does not help. In fact I do not see a different behavior
>> at
>> all?
>>> I reflashed the controller multiple times and erased everything except
>>> the
>> "manufacturing" area to make sure that no previous settings are kept.
>>> The only thing I know that "fixes" the missing drives is to reboot the
>>> server.
>>>
>>> A (similar?) problem also occurs once I start the server with all 6 disk
>> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
>> properly with 5 shelves, once I offline connect the 6th shelve, then some
>> random disks are missing and I cannot longer import the ZFS pool.
>>>
>>> The following logs were collected with the very old FW 09.00.101.00 that
>> worked on Linux.
>>> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
>>>
>>> best regards,
>>> Oliver
>>>
>>> On 07/12/2018 03:38 PM, Ken Merry wrote:
>>>>
>>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]>
>> wrote:
>>>>>
>>>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
>>>>>> Oliver, what happens when you try to do I/O to the devices that don’t
>> go away after you pull the cable?  Does that cause the devices to go away?
>>>>>
>>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least
>>>>> the
>> "da" device disappears.
>>>>
>>>> Ok, that’s good.  Can you send the dmesg output and check with
>> ‘camcontrol devlist -v’ to make sure the device has fully gone away?
>>>>
>>>> The reason I ask is that I have spent lots of time over the years
>>>> debugging
>> device arrival and departure problems in CAM, GEOM and devfs, and I want
>> to make sure we aren’t running into any non-SAS related problems.
>>>>
>>>>>
>>>>>> Looking at the mprutil output, it also shows the devices sticking
>>>>>> around
>> from the adapter’s standpoint.
>>>>>>
>>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’
>> (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will
>> do
>> some basic probes for each of the devices and should in theory cause them
>> to go away if they aren’t accessible.
>>>>>>
>>>>>> It seems like the adapter may not be recognizing that the devices in
>> question have gone.
>>>>>
>>>>>
>>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times.
>>>>> While
>> I not sure anymore if that cleans up the non-working devices, I'm sure
>> that
>> no new devices were added.
>>>>
>>>> If doing a read from the device with dd makes it go away, ‘camcontrol
>> rescan all’ should make it go away as well.  It sends command to every
>> device, and if the mpr(4) driver tells CAM the drive is no longer there,
>> it’ll get
>> removed.
>>>>
>>>> If it doesn’t cause the device to get removed (and the rescan doesn’t
>> hang), it means that you’re getting a response from a device that is no
>> longer physically connected to the machine, which is impossible with SAS.
>>>>
>>>>>
>>>>> Unfortunately I haven't gotten yet to Steves 'clear controller
>>>>> mapping'
>> script but I did a few other things:
>>>>
>>>> Steve’s email made it sound like he was going to send it.  I just sent
>>>> it to
>> you separately.
>>>>
>>>>> * The last time I tried to upgrade the firmware I had all sorts of
>> problems. "sas3flash" reported bad checksums while flashing some of the
>> files.
>>>>> So I reflashed both controllers with the DOS version of sas3flash.
>>>>> This
>> was basically a challenge in itself because the DOS version of this
>> utility does
>> not seem to run on computers of this decade. (ERROR:  Failed to initialize
>> PAL.  Exiting program.)
>>>>> The equivalent sas3flash.EFI version seems to be out of date and
>>>>> caused
>> the checksum problems described before.
>>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
>>>>
>>>> That is unfortunate…perhaps Steve has some insight.
>>>>
>>>>>
>>>>> * I tried to change mpr tuneable "use_phy_num" after that but this has
>> not improved the situation. I will retry and collect logs with Steves
>> script.
>>>>
>>>> Changed it to what?  I think it defaults to 1.  Did you try 0?
>>>>
>>>>> * I retried with the latest "mpr.ko" from the broadcom download page.
>> (Same problems, no "use_phy_num" tuneable.)
>>>>>
>>>>> * I retested this hardware with Linux (4.15 and 4.17)
>>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45
>> disks disappear, 45 disks reappear)
>>>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44
>> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
>> mpt3sas_cm0: "device is not present handle)
>>>>>
>>>>> * I tired a different controller
>>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
>> (Firmware 16.00.01.00 or 15.00.00.00)
>>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
>> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar
>> with 09*))
>>>>> With the new controller everything seems work on Linux. It might be
>>>>> the
>> old Firmware?...
>>>>> It is better with the new controller on FreeBSD in that sense that I
>>>>> at
>> least get one out of two /dev/sesX devices back. But disks are still
>> missing
>> and are not getting completely cleaned up…
>>>>
>>>> It does sound a bit like a mapping table problem.  Clearing it might
>>>> help,
>> we’ll see.
>>>>
>>>>> This whole thing is a bit frustrating, especially since up until now I
>> thought that HBAs are kind of "connect and forget" devices. Next step is
>> to
>> set up a separate test environment and try to get it to work there. I will
>> keep
>> you updated and try provide log for all FreeBSD related problems.
>>>>
>>>> Thanks for debugging this.  Unfortunately there are a number of ways it
>> can go wrong.  The mapping code has been the source of some problems,
>> sometimes enclosure vendors do the wrong thing, and sometimes there are
>> other bugs.
>>>>
>>>> Ken
>>>>
>>> _______________________________________________
>>> [hidden email] mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "[hidden email]"
>>>
>> _______________________________________________
>> [hidden email] mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

RE: problems with SAS JBODs 2

freebsd-scsi mailing list
Can you enable Mapping Debugging, then do these steps again and send the
logs. If I don't see anything interesting in the logs I might have you turn
more debug bits on. So, first set the debug_level to 0x203. What I'm looking
for is some indication that the driver is dropping a device or not adding
it. It that's not happening at the driver level, something else is causing
the problem. You can try setting the Event Debug flag as well, but that
might be too overwhelming to capture (debug_level = 0x207).

Steve

> -----Original Message-----
> From: Oliver Sech [mailto:[hidden email]]
> Sent: Wednesday, July 25, 2018 4:24 AM
> To: Stephen Mcconnell; FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> I ran the clear_dpm.sh script and changed the value you suggested.
> Rebooted and retested. As far as I can tell there is no difference.
>
> I tried the menu option (99.  Reset port) in lsiutil and this helps with
> missing
> devices. After reseting the port I get all my disks and ses devs again.
>
> Read NVRAM or current values?  [0=NVRAM, 1=Current, default is 0]
>
> 0000 : 21080600
> 0004 : 00000001
> 0008 : 00180080
> 000c : 00000001
> 0010 : 00000000
> 0014 : 00000000
>
> On 07/24/2018 10:22 PM, Stephen Mcconnell wrote:
> > Oliver, can you try changing the mapping mode on the controller? I think
> > you're using Enclosure/Slot Mapping and I want to see what happens with
> > Device Persistent Mapping. To do that, follow these steps:
> > 1. Run Ken’s script to clear the DPM entries
> > 2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9,
> Page
> > Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
> > Enclosure/Slot Mapping and I'd like you to change this. You will be
> > asked if
> > you want to make changes. Select ‘yes’ and then change offset 0x0C to
> > 00000001 (you might have to type C instead of 0x0C for the offset). Just
> use
> > the default setting to change NVRAM.
> > 3. Reboot and see what happens and let me know how it goes.
> >
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: [hidden email] [mailto:owner-freebsd-
> >> [hidden email]] On Behalf Of Oliver Sech
> >> Sent: Tuesday, July 24, 2018 12:23 PM
> >> To: FreeBSD-scsi
> >> Subject: Re: problems with SAS JBODs 2
> >>
> >> update 2: I continued to test with more and different hardware.
> >>
> >> tested with a LSI SAS9207-8e HBA:
> >> * after disconnect all devices properly disappear /dev/daX /dev/ses
> >> no rescans or writing necessary
> >> * no more targets in mpsutil (not mprutil)
> >> * after reconnect all disks and all ses devs appear!
> >>
> >> tested with hardware raid LSI SAS 9286CV-8e
> >> * no problems with the shelf/sas in different configurations
> >> * switching the controller and importing configuration works reliably
> >>
> >> So far I think there is a problem with the mpr driver and I'm quite
> >> confident
> >> that it does affect other people.
> >> With a simple configuration is probably not immediately noticeable as
> >> everything seems to work after the first connect/boot.
> >> It probably gets scarier for people with multipathing and big SAS
> >> chains I
> >> guess...
> >>
> >> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
> >> there is
> >> anything I can help with while I still have hardware in the lab let me
> >> know.
> >>
> >> Oliver
> >>
> >> On 07/23/2018 04:14 PM, Oliver Sech wrote:
> >>> Sorry for the delay. I moved to a different office and could not focus
> >>> on
> >> this issue last week.
> >>>
> >>> I tested all of the hardware with different drivers and firmware on
> >>> Linux to
> >> make sure this is not a hardware problem:
> >>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> >>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> >>> * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44
> >>> disks
> >> after reconnect)
> >>> * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44
> >>> disks
> >> after reconnect)
> >>>
> >>> I tested a different HBA with an old firmware as well and there were
> >>> no
> >> issues. Only with the latest FW disks are missing after a reconnect
> >> with
> >> the
> >> error "mpt3sas_cm0: "device is not present handle"
> >>> I don't know yet how different Firmware behaves between version
> >> 09.00.000.00 and 16...
> >>>
> >>> Additional Info/Changes:
> >>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
> >> Change
> >>> * "camcontrol rescan all" removes the devices that are still present
> >>> after
> >> the cable has been removed. "camcontrol devlist -v" does not show them
> >> anymore
> >>>
> >>>
> >>> Setting the driver "use_phy_num" to 0 and using the clearDPM script
> >> between connects does not help. In fact I do not see a different
> >> behavior
> >> at
> >> all?
> >>> I reflashed the controller multiple times and erased everything except
> >>> the
> >> "manufacturing" area to make sure that no previous settings are kept.
> >>> The only thing I know that "fixes" the missing drives is to reboot the
> >>> server.
> >>>
> >>> A (similar?) problem also occurs once I start the server with all 6
> >>> disk
> >> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
> >> properly with 5 shelves, once I offline connect the 6th shelve, then
> >> some
> >> random disks are missing and I cannot longer import the ZFS pool.
> >>>
> >>> The following logs were collected with the very old FW 09.00.101.00
> >>> that
> >> worked on Linux.
> >>> Logs:
> https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
> >>>
> >>> best regards,
> >>> Oliver
> >>>
> >>> On 07/12/2018 03:38 PM, Ken Merry wrote:
> >>>>
> >>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <[hidden email]>
> >> wrote:
> >>>>>
> >>>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
> >>>>>> Oliver, what happens when you try to do I/O to the devices that
> don’t
> >> go away after you pull the cable?  Does that cause the devices to go
> away?
> >>>>>
> >>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least
> >>>>> the
> >> "da" device disappears.
> >>>>
> >>>> Ok, that’s good.  Can you send the dmesg output and check with
> >> ‘camcontrol devlist -v’ to make sure the device has fully gone away?
> >>>>
> >>>> The reason I ask is that I have spent lots of time over the years
> >>>> debugging
> >> device arrival and departure problems in CAM, GEOM and devfs, and I
> want
> >> to make sure we aren’t running into any non-SAS related problems.
> >>>>
> >>>>>
> >>>>>> Looking at the mprutil output, it also shows the devices sticking
> >>>>>> around
> >> from the adapter’s standpoint.
> >>>>>>
> >>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan
> >>>>>> N’
> >> (where N is the scbus number shown by ‘camcontrol devlist -v’).  That
> >> will
> >> do
> >> some basic probes for each of the devices and should in theory cause
> them
> >> to go away if they aren’t accessible.
> >>>>>>
> >>>>>> It seems like the adapter may not be recognizing that the devices
> >>>>>> in
> >> question have gone.
> >>>>>
> >>>>>
> >>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few
> >>>>> times.
> >>>>> While
> >> I not sure anymore if that cleans up the non-working devices, I'm sure
> >> that
> >> no new devices were added.
> >>>>
> >>>> If doing a read from the device with dd makes it go away, ‘camcontrol
> >> rescan all’ should make it go away as well.  It sends command to every
> >> device, and if the mpr(4) driver tells CAM the drive is no longer
> >> there,
> >> it’ll get
> >> removed.
> >>>>
> >>>> If it doesn’t cause the device to get removed (and the rescan doesn’t
> >> hang), it means that you’re getting a response from a device that is no
> >> longer physically connected to the machine, which is impossible with
> >> SAS.
> >>>>
> >>>>>
> >>>>> Unfortunately I haven't gotten yet to Steves 'clear controller
> >>>>> mapping'
> >> script but I did a few other things:
> >>>>
> >>>> Steve’s email made it sound like he was going to send it.  I just
> >>>> sent
> >>>> it to
> >> you separately.
> >>>>
> >>>>> * The last time I tried to upgrade the firmware I had all sorts of
> >> problems. "sas3flash" reported bad checksums while flashing some of the
> >> files.
> >>>>> So I reflashed both controllers with the DOS version of sas3flash.
> >>>>> This
> >> was basically a challenge in itself because the DOS version of this
> >> utility does
> >> not seem to run on computers of this decade. (ERROR:  Failed to
> >> initialize
> >> PAL.  Exiting program.)
> >>>>> The equivalent sas3flash.EFI version seems to be out of date and
> >>>>> caused
> >> the checksum problems described before.
> >>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
> >>>>
> >>>> That is unfortunate…perhaps Steve has some insight.
> >>>>
> >>>>>
> >>>>> * I tried to change mpr tuneable "use_phy_num" after that but this
> has
> >> not improved the situation. I will retry and collect logs with Steves
> >> script.
> >>>>
> >>>> Changed it to what?  I think it defaults to 1.  Did you try 0?
> >>>>
> >>>>> * I retried with the latest "mpr.ko" from the broadcom download
> page.
> >> (Same problems, no "use_phy_num" tuneable.)
> >>>>>
> >>>>> * I retested this hardware with Linux (4.15 and 4.17)
> >>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up,
> >>>>> 45
> >> disks disappear, 45 disks reappear)
> >>>>> ** The newest shelf 2 disks were missing after the replugging (ie:
> >>>>> 44
> >> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
> >> mpt3sas_cm0: "device is not present handle)
> >>>>>
> >>>>> * I tired a different controller
> >>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
> >> (Firmware 16.00.01.00 or 15.00.00.00)
> >>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
> >> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something
> similar
> >> with 09*))
> >>>>> With the new controller everything seems work on Linux. It might be
> >>>>> the
> >> old Firmware?...
> >>>>> It is better with the new controller on FreeBSD in that sense that I
> >>>>> at
> >> least get one out of two /dev/sesX devices back. But disks are still
> >> missing
> >> and are not getting completely cleaned up…
> >>>>
> >>>> It does sound a bit like a mapping table problem.  Clearing it might
> >>>> help,
> >> we’ll see.
> >>>>
> >>>>> This whole thing is a bit frustrating, especially since up until now
> >>>>> I
> >> thought that HBAs are kind of "connect and forget" devices. Next step
> >> is
> >> to
> >> set up a separate test environment and try to get it to work there. I
> >> will
> >> keep
> >> you updated and try provide log for all FreeBSD related problems.
> >>>>
> >>>> Thanks for debugging this.  Unfortunately there are a number of ways
> >>>> it
> >> can go wrong.  The mapping code has been the source of some problems,
> >> sometimes enclosure vendors do the wrong thing, and sometimes there
> are
> >> other bugs.
> >>>>
> >>>> Ken
> >>>>
> >>> _______________________________________________
> >>> [hidden email] mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >>> To unsubscribe, send any mail to "freebsd-scsi-
> [hidden email]"
> >>>
> >> _______________________________________________
> >> [hidden email] mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "[hidden email]"