ZFS: how to replace a dead disk?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ZFS: how to replace a dead disk?

James R. Van Artsdalen
What's the right way to replace a dead disk under ZFS?

When ada1 failed I tried "zpool replace jwrc ada1" and when it finished
I got this:

root@cyclone ~]# zpool status
  pool: jwrc
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 14h55m with 11 errors on Fri May 28
04:28:56 2010
config:

    NAME            STATE     READ WRITE CKSUM
    jwrc            DEGRADED     0     0    11
      raidz1        DEGRADED     0     0    23
        ada2        ONLINE       0     0     0  889M resilvered
        replacing   DEGRADED     0     0     0
          ada1/old  UNAVAIL      0  256K     0  cannot open
          ada1      ONLINE       0     0     0  1.47T resilvered
        ada3        ONLINE       0     0     0  879M resilvered
        ada4        ONLINE       0     0     0  808M resilvered

errors: 5 data errors, use '-v' for a list

---

It says "replacing" and that the device, vdev and pool are degraded, yet
the "resilver" finished hours ago.  I cannot detach the ada1/old entry.

Is there some other command I should have used to remove the dead ada1
device?

This kernel 206111, roughly April 1, on amd64
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: ZFS: how to replace a dead disk?

Jeremy Chadwick
On Fri, May 28, 2010 at 08:36:38AM -0500, James R. Van Artsdalen wrote:

> What's the right way to replace a dead disk under ZFS?
>
> When ada1 failed I tried "zpool replace jwrc ada1" and when it finished
> I got this:
>
> root@cyclone ~]# zpool status
>   pool: jwrc
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>     corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>     entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: resilver completed after 14h55m with 11 errors on Fri May 28
> 04:28:56 2010
> config:
>
>     NAME            STATE     READ WRITE CKSUM
>     jwrc            DEGRADED     0     0    11
>       raidz1        DEGRADED     0     0    23
>         ada2        ONLINE       0     0     0  889M resilvered
>         replacing   DEGRADED     0     0     0
>           ada1/old  UNAVAIL      0  256K     0  cannot open
>           ada1      ONLINE       0     0     0  1.47T resilvered
>         ada3        ONLINE       0     0     0  879M resilvered
>         ada4        ONLINE       0     0     0  808M resilvered
>
> errors: 5 data errors, use '-v' for a list
>
> ---
>
> It says "replacing" and that the device, vdev and pool are degraded, yet
> the "resilver" finished hours ago.  I cannot detach the ada1/old entry.
>
> Is there some other command I should have used to remove the dead ada1
> device?

What version of FreeBSD?  Please provide uname -a output and not "8.0"
or something equally as terse.

Some clarification: you didn't remove the device, you simply told ZFS to
assuming that the device had been replaced.

What did you do (both physically and software/command-line-wise) *prior*
to issuing "zpool replace jwrc ada1"?

--
| Jeremy Chadwick                                   [hidden email] |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: ZFS: how to replace a dead disk?

James R. Van Artsdalen
Jeremy Chadwick wrote:

> On Fri, May 28, 2010 at 08:36:38AM -0500, James R. Van Artsdalen wrote:
>> What's the right way to replace a dead disk under ZFS?
>>
>>         replacing   DEGRADED     0     0     0
>>           ada1/old  UNAVAIL      0  256K     0  cannot open
>>           ada1      ONLINE       0     0     0  1.47T resilvered
>> ---
>>
>> It says "replacing" and that the device, vdev and pool are degraded, yet
>> the "resilver" finished hours ago.  I cannot detach the ada1/old entry.
>>
>> Is there some other command I should have used to remove the dead ada1
>> device?
>
> What version of FreeBSD?  Please provide uname -a output and not "8.0"
> or something equally as terse.
>
> Some clarification: you didn't remove the device, you simply told ZFS to
> assuming that the device had been replaced.
>
> What did you do (both physically and software/command-line-wise) *prior*
> to issuing "zpool replace jwrc ada1"?
>
Sorry: my original note contained version information but that isn't in
your reply?

FreeBSD cyclone 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r206111: Fri Apr  2
13:47:20 CDT 2010    
[hidden email]:/usr/obj/usr/src/sys/GENERIC  amd64

The original disk is no longer usable by FreeBSD in any way: it returned
a stream of errors & noise on its port in a way that left the system
unable to boot.  I physically replaced that disk with a new disk before
attempting the "zpool replace"

No actions were taken prior to replacing the disk.  I went to the site
to see why the server was unresponsive, saw that one drive was
problematic by watching the activity LEDs, physically replaced that
disk, booted, and logically replaced that disk with "zpool replace jwrc
ada1"
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: ZFS: how to replace a dead disk?

Jeremy Chadwick
On Fri, May 28, 2010 at 11:34:23AM -0500, James R. Van Artsdalen wrote:

> Jeremy Chadwick wrote:
> > On Fri, May 28, 2010 at 08:36:38AM -0500, James R. Van Artsdalen wrote:
> >> What's the right way to replace a dead disk under ZFS?
> >>
> >>         replacing   DEGRADED     0     0     0
> >>           ada1/old  UNAVAIL      0  256K     0  cannot open
> >>           ada1      ONLINE       0     0     0  1.47T resilvered
> >> ---
> >>
> >> It says "replacing" and that the device, vdev and pool are degraded, yet
> >> the "resilver" finished hours ago.  I cannot detach the ada1/old entry.
> >>
> >> Is there some other command I should have used to remove the dead ada1
> >> device?
> >
> > What version of FreeBSD?  Please provide uname -a output and not "8.0"
> > or something equally as terse.
> >
> > Some clarification: you didn't remove the device, you simply told ZFS to
> > assuming that the device had been replaced.
> >
> > What did you do (both physically and software/command-line-wise) *prior*
> > to issuing "zpool replace jwrc ada1"?
> >
> Sorry: my original note contained version information but that isn't in
> your reply?
>
> FreeBSD cyclone 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r206111: Fri Apr  2
> 13:47:20 CDT 2010    
> [hidden email]:/usr/obj/usr/src/sys/GENERIC  amd64

The only line sent to the list was: "This kernel 206111, roughly April
1, on amd64".  Here's verification:

http://lists.freebsd.org/pipermail/freebsd-fs/2010-May/008592.html

So now we know you're running HEAD.

> The original disk is no longer usable by FreeBSD in any way: it returned
> a stream of errors & noise on its port in a way that left the system
> unable to boot.  I physically replaced that disk with a new disk before
> attempting the "zpool replace"
>
> No actions were taken prior to replacing the disk.  I went to the site
> to see why the server was unresponsive, saw that one drive was
> problematic by watching the activity LEDs, physically replaced that
> disk, booted, and logically replaced that disk with "zpool replace jwrc
> ada1"

I think the procedure you executed might be the problem.  The steps I've
used in the past, 100% reliably, with ata(4) and AHCI on an ICH7 and
ICH9 are:

 1. zpool offline adX
 2. atacontrol list (to find the ataX device number)
 3. atacontrol detach ataX
 4. dmesg (verify the detach worked)
 5. Physically remove the disk (must be in a hot-swap enclosure)
 6. Physically insert the new disk
 7. atacontrol attach ataX
 8. dmesg (to determine what the adX drive number is; on my systems
    the adX drive number remains static/does not change)
 9. zpool online pool adX
10. zpool replace pool adX
11. zpool status  (watch until finished)

This is adherent to the Solaris ZFS Administrator's guide, except that
atacontrol(8) is being used instead of cfgadm(1M).  See Example 11-1:

http://docs.sun.com/app/docs/doc/819-5461/gbbzy?l=en&a=view

The same procedure should ideally be followed using ahci.ko + CAM, using
camcontrol devlist/eject, camcontrol rescan (may not be needed but use
devlist to verify the kernel noticing removal/additions), and camcontrol
load.

If you'd like me to verify and demonstrate this on FreeBSD (RELENG_8
only, however -- I don't run CURRENT) I can do so.  I can also do the
same thing with ahci.ko + CAM.  Just let me know.

--
| Jeremy Chadwick                                   [hidden email] |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: ZFS: how to replace a dead disk?

Wes Morgan-2
In reply to this post by James R. Van Artsdalen
On Fri, 28 May 2010, James R. Van Artsdalen wrote:

> Jeremy Chadwick wrote:
> > On Fri, May 28, 2010 at 08:36:38AM -0500, James R. Van Artsdalen wrote:
> >> What's the right way to replace a dead disk under ZFS?
> >>
> >>         replacing   DEGRADED     0     0     0
> >>           ada1/old  UNAVAIL      0  256K     0  cannot open
> >>           ada1      ONLINE       0     0     0  1.47T resilvered
> >> ---
> >>
> >> It says "replacing" and that the device, vdev and pool are degraded, yet
> >> the "resilver" finished hours ago.  I cannot detach the ada1/old entry.
> >>
> >> Is there some other command I should have used to remove the dead ada1
> >> device?
> >
> > What version of FreeBSD?  Please provide uname -a output and not "8.0"
> > or something equally as terse.
> >
> > Some clarification: you didn't remove the device, you simply told ZFS to
> > assuming that the device had been replaced.
> >
> > What did you do (both physically and software/command-line-wise) *prior*
> > to issuing "zpool replace jwrc ada1"?
> >
> Sorry: my original note contained version information but that isn't in
> your reply?
>
> FreeBSD cyclone 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r206111: Fri Apr  2
> 13:47:20 CDT 2010
> [hidden email]:/usr/obj/usr/src/sys/GENERIC  amd64
>
> The original disk is no longer usable by FreeBSD in any way: it returned
> a stream of errors & noise on its port in a way that left the system
> unable to boot.  I physically replaced that disk with a new disk before
> attempting the "zpool replace"
>
> No actions were taken prior to replacing the disk.  I went to the site
> to see why the server was unresponsive, saw that one drive was
> problematic by watching the activity LEDs, physically replaced that
> disk, booted, and logically replaced that disk with "zpool replace jwrc
> ada1"

Is it possible your array had some errors on other devices prior to the
ada1 disk failing? When was the last scrub?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"