RFC: Suggesting ZFS "best practices" in FreeBSD

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFC: Suggesting ZFS "best practices" in FreeBSD

Borja Marcos-2
(Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)



Hi :)

Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").

ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.

Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus.



1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)

ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.

For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.

After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.

When creating a pool, I use these names. Instead of dealing with device numbers. For example:

% zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 2013
config:

        NAME             STATE     READ WRITE CKSUM
        rpool            ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            gpt/rpool-disk1       ONLINE       0     0     0
            gpt/rpool-disk2       ONLINE       0     0     0
        logs
          gpt/zfs-log    ONLINE       0     0     0
        cache
          gpt/zfs-cache  ONLINE       0     0     0

Using a unique name for each disk within your organization is important. That way, you can safely move the disks to a different server, which might be using ZFS, and still be able to import the pool without name collisions. Of course  you could use gptids, which, as far as I know, are unique, but they are difficult to use and in case  of a disk failure it's not easy to determine which disk to replace.




2- RAID cards.

Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards.

There are two solutions adopted by RAID card users. None of them is good. The first an obvious one is to create a RAID5 taking advantage of the battery based cache (if present). It works, but it loses some of the advantages of ZFS. Moreover, trying different cards, I have been forced to reboot whole servers in order to do something trivial like replacing a failed disk. Yes, there are software tools to control some of the cards, but they are at the very least cumbersome and confusing.

The second "solution" is to create a RAID0 volume for each disk (some RAID card manufacturers even dare to call it JBOD). I haven't seen a single instance of this working flawlessly. Again, a replaced disk can be a headache. At the very least, you have to deal with a cumbersome and complicated management program to replace a disk, and you often have to reboot the server.

The biggest reason to avoid these stupid cards, anyway, is plain simple: Those cards, at least the ones I have tried bundled by Dell as PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense codes from the filesystem. Pure crap.

Years ago, fighting this issue, and when ZFS was still rather experimental, I asked for help and Scott Long sent me a "don't try this at home" simple patch, so that the disks become available to the CAM layer, bypassing the RAID card. He warned me of potential issues and lost sense codes, but, so far so good. And indeed the sense codes are lost when a RAID card creates a volume, even if in the misnamed "JBOD" configuration.


http://www.mavetju.org/mail/view_message.php?list=freebsd-scsi&id=2634817&raw=yes
http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679

Anyway, even if there might be some issues due to command handling, the end to end verification performed by ZFS should ensure that, as a minimum, the data on the disks won't  be corrupted and, in case it happens, it will be detected. I rather prefer to have ZFS deal with it, instead of working on a sort of "virtual" disk implemented on the RAID card.

Another *strong* reason to avoid those cards, even "JBOD" configurations, is disk portability. The RAID labels the disks. Moving one disk from one machine to another one will result on a funny situation of confusing "import foreign config/ignore" messages when rebooting the destination server (mandatory in order to be able to access the transferred disk). Once again, additional complexity, useless layering and more reboots. That may be acceptable for Metoosoft crap, not for Unix systems.

Summarizing: I would *strongly* recommend to avoid the RAID cards and get proper host adapters without any fancy functionalities instead. The one sold by Dell as H200 seems to work very well. No need to create any JBOD or fancy thing at all. It will just expose the drivers as normal SAS/SATA ones. A host adapter without fancy firmware is the best guarantee about failures caused by fancy firmware.

But, in case that´s not possible, I am still leaning to the kludge of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing by patching the driver. There is one issue, though. In case of reboot, the RAID cards freeze, I am not sure why. Maybe that could be fixed,  it happens on machines on which I am not using the RAID functionality at all. They should become "transparent" but they don't.

Also, I think that  the so-called JBOD thing would impair the correct performance of a zfs health daemon doing things such as automatic failed disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ log message for diagnosis.

(See at the bottom to read about a problem I have just had with a "JBOD" configuration)




3- Installation, boot, etc.

Here I am not sure. Before zfsboot became available, I used to create a zfs-on-root system by doing, more or less, this:

- Install base system on a pendrive. After the installation, just /boot will be used  from the pendrive, and /boot/loader.conf will

- Create the ZFS pool.

- Create and populate the root hierarchy. I used to create something like:

pool/root
pool/root/var
pool/root/usr
pool/root/tmp

Why pool/root instead of simply "pool"? Because it's easier to understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's possible to snapshot the whole "system" tree atomically.

I also set the mountpoint of the "system" tree as legacy, and rely on /etc/fstab. Why? In order to avoid an accidental "auto mount"  of critical filesystems in case, for example, I boot off a pendrive in order to tinker.

For the last system I installed, I tried with zfsboot instead of booting off the /boot directory of a FFS partition.




(*) An example of RAID/JBOD induced crap and the problem of not using static naming follows,

I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and one of those cards I worship: this particular example is controlled by the aac driver.

As I was going to tinker a lot, I decided to create a raid-based mirror for the system, so that I can boot off it and have swap even with a failed disk, and use the other 14 disks as a pool with two raidz vdevs of 6 disks, leaving two disks as hot-spares. Later  I removed one of the hot-spares and I installed a SSD disk with two partitions to try and make it work as L2ARC  and log. As I had gone for the jbod pain, of course replacing that disk meant rebooting the server in order to do something as illogical as creating a "logical" volume on top of it. These cards just love to be rebooted.

  pool: pool
 state: ONLINE
  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 2013
config:

        NAME             STATE     READ WRITE CKSUM
        pool             ONLINE       0     0     0
          raidz1-0       ONLINE       0     0     0
            aacd1        ONLINE       0     0     0
            aacd2        ONLINE       0     0     0
            aacd3        ONLINE       0     0     0
            aacd4        ONLINE       0     0     0
            aacd5        ONLINE       0     0     0
            aacd6        ONLINE       0     0     0
          raidz1-1       ONLINE       0     0     0
            aacd7        ONLINE       0     0     0
            aacd8        ONLINE       0     0     0
            aacd9        ONLINE       0     0     0
            aacd10       ONLINE       0     0     0
            aacd11       ONLINE       0     0     0
            aacd12       ONLINE       0     0     0
        logs
          gpt/zfs-log    ONLINE       0     0     0
        cache
          gpt/zfs-cache  ONLINE       0     0     0
        spares
          aacd14         AVAIL  

errors: No known data errors



The fun begun when a disk failed. When it happened, I offlined it, and replaced it by the remaining hot-spare. But something had changed, and the pool remained in this state:

% zpool status
  pool: pool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 2012
config:

        NAME                        STATE     READ WRITE CKSUM
        pool                        DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            spare-0                 DEGRADED     0     0     0
              13277671892912019085  OFFLINE      0     0     0  was /dev/aacd1
              aacd14                ONLINE       0     0     0
            aacd2                   ONLINE       0     0     0
            aacd3                   ONLINE       0     0     0
            aacd4                   ONLINE       0     0     0
            aacd5                   ONLINE       0     0     0
            aacd6                   ONLINE       0     0     0
          raidz1-1                  ONLINE       0     0     0
            aacd7                   ONLINE       0     0     0
            aacd8                   ONLINE       0     0     0
            aacd9                   ONLINE       0     0     0
            aacd10                  ONLINE       0     0     0
            aacd11                  ONLINE       0     0     0
            aacd12                  ONLINE       0     0     0
        logs
          gpt/zfs-log               ONLINE       0     0     0
        cache
          gpt/zfs-cache             ONLINE       0     0     0
        spares
          2388350688826453610       INUSE     was /dev/aacd14

errors: No known data errors
%


ZFS was somewhat confused by the JBOD volumes, and it was impossible to end this situation. A reboot revealed that the card,  apparently, had changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a single bit of data, but the situation seemed to be risky. Finally I could fix it by replacing the failed disk, rebooting the whole server, of course, and doing a zpool replace. But the card added some confusion, and I still don't know what was the disk failure. No traces of a meaningful error message.




Best regards,






Borja.


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Andreas Nilsson-8
Seems good to me, but I think you could/should push for poolnames that
are unique as weel, to ease import on another system.

I would recommend avoiding HP raid cards as well ( at least P400i and P410i
) as gptzfsboot cannot find a pool on the first "logical"  disk presented
to the os, see
http://lists.freebsd.org/pipermail/freebsd-current/2011-August/026175.html

Best regards
Andreas
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Andriy Gapon
In reply to this post by Borja Marcos-2
on 22/01/2013 13:03 Borja Marcos said the following:
> pool/root pool/root/var pool/root/usr pool/root/tmp
>
> Why pool/root instead of simply "pool"? Because it's easier to understand,
> snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's
> possible to snapshot the whole "system" tree atomically.

I recommend placing "/" into pool/ROOT/<current-name>.
That would very useful for boot environments (BEs - use them!).

> I also set the mountpoint of the "system" tree as legacy, and rely on
> /etc/fstab.

I do place anything for ZFS into fstab.
Nor I use vfs.root.mountfrom loader variable.
I depend on the boot and kernel code doing the right thing based on pool's bootfs
property.

> Why? In order to avoid an accidental "auto mount"  of critical
> filesystems in case, for example, I boot off a pendrive in order to tinker.

Not sure what you mean, if you don't import the pool nothing gets mounted.
If you remember to use import -R then everything gets mounted in controlled places.

--
Andriy Gapon
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Scott Long-2
In reply to this post by Borja Marcos-2

On Jan 22, 2013, at 4:03 AM, Borja Marcos <[hidden email]> wrote:

> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)
>
>
>
> Hi :)
>
> Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").
>
> ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.
>
> Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus.
>
>
>
> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
>
> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
>

Look up SCSI device wiring in /sys/conf/NOTES.  That's one solution to static naming, just with a slightly different angle than Solaris.  I do agree with your general thesis here, and either wiring should be made a much more visible and documented feature, or a new mechanism should be developed to provide naming stability.  Please let me know what you think of the wiring mechanic.
>
>
> 2- RAID cards.
>
> Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards.
>

Agree 200%.  Despite the best effort of sales and marketing people, RAID cards do not make good HBAs.  At best they add latency.  At worst, they add a lot of latency and extra failure modes.

Scott

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Warren Block
In reply to this post by Borja Marcos-2
On Tue, 22 Jan 2013, Borja Marcos wrote:

> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
>
> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
>
> For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.
>
> After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.

I'm a proponent of using various types of labels, but my impression
after a recent experience was that ZFS metadata was enough to identify
the drives even if they were moved around.  That is, ZFS bare metadata
on a drive with no other partitioning or labels.

Is that incorrect?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Warren Block
In reply to this post by Borja Marcos-2
On Tue, 22 Jan 2013, Borja Marcos wrote:

> Hope nobody will hate me too much, but ZFS usage under FreeBSD is
> still chaotic. We badly need a well proven "doctrine" in order to
> avoid problems.

I would like to see guidelines for at least two common scenarios:

Multi-terabyte file server with multi-drive pool.
Limited RAM (1G) root-on-ZFS workstation with a single disk.

The first is easy with the defaults, but particular tuning could be
beneficial.  And would be a good place to talk about NFS on ZFS, usage
of SSDs, and so on.

The second is supposed to be achievable, but the specifics...

These could go in the ZFS section in the Handbook.  I'm interested in
working on that.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Mark Felder-4
In reply to this post by Warren Block
On Tue, 22 Jan 2013 09:04:42 -0600, Warren Block <[hidden email]>  
wrote:

> I'm a proponent of using various types of labels, but my impression  
> after a recent experience was that ZFS metadata was enough to identify  
> the drives even if they were moved around.  That is, ZFS bare metadata  
> on a drive with no other partitioning or labels.
>  Is that incorrect?

If you have an enclosure with 48 drives can you be confident which drive  
is failing using only the ZFS metadata?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Borja Marcos-2
In reply to this post by Warren Block

On Jan 22, 2013, at 4:04 PM, Warren Block wrote:

> I'm a proponent of using various types of labels, but my impression after a recent experience was that ZFS metadata was enough to identify the drives even if they were moved around.  That is, ZFS bare metadata on a drive with no other partitioning or labels.
>
> Is that incorrect?

I'm afraid it's inconvenient unless you enjoy reboots ;)

This is a patologic and likely example I just demonstrated to a friend.

We were testing a new server with 12 hard disks and a proper HBA.

The disks are, unspririsingly, da0-da11. There is a da12 used (for now) for the OS, so that there's no problem to create and destroy pools at leisure.

My friend had created a pool with two raidz vdevs nothing rocket science. da0-5, da6-11.

So, we were doing some tests and I've pulled one of the disks. Nothing special, ZFS recovers nicely.

Now it comes the fun part.

I reboot the machine with the missing disk.

What happens now?

I had pulled da4 I think. So, disks with an ID > 4 have been renamed to N  - 1. da5 became da4, da6 became da5, da7 became da6... and, critically, da12 became da11.

The reboot begun by failing to mount the root filesystem, but that one is trivial. Just tell the kernel where it is now (da11) and it boots happily.

Now, we have a degraded pool with a missing disk (da4) and a da4 that previously was da5. It works of course, but in degraded state.

OK, we found a replacement disk, and we plugged it. It became, guess! Yes, da12.

Now: I cannot "online" da4, because it exists. I cannot online da12  because it didn't belong to the pool. I cannot replace da4 with da12, because it is there.

Now that I think of it, in this case:
15896790606386444480  OFFLINE      0     0     0  was /dev/da4

Is it possible to say zpool replace 15896790606386444480 da12? I haven't tried it.

Anyway, seems to be a bit confusing. The logical, albeit cumbersome approach is to reboot the machine with the new da4 in place, and after rebooting, onlining or replacing.

Using names prevents this kind of confusion.






Borja.

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

claudiu vasadi
On Tue, Jan 22, 2013 at 4:36 PM, Borja Marcos <[hidden email]> wrote:

>
> On Jan 22, 2013, at 4:04 PM, Warren Block wrote:
>
> > I'm a proponent of using various types of labels, but my impression
> after a recent experience was that ZFS metadata was enough to identify the
> drives even if they were moved around.  That is, ZFS bare metadata on a
> drive with no other partitioning or labels.
> >
> > Is that incorrect?
>
> I'm afraid it's inconvenient unless you enjoy reboots ;)
>
> This is a patologic and likely example I just demonstrated to a friend.
>
> We were testing a new server with 12 hard disks and a proper HBA.
>
> The disks are, unspririsingly, da0-da11. There is a da12 used (for now)
> for the OS, so that there's no problem to create and destroy pools at
> leisure.
>
> My friend had created a pool with two raidz vdevs nothing rocket science.
> da0-5, da6-11.
>
> So, we were doing some tests and I've pulled one of the disks. Nothing
> special, ZFS recovers nicely.
>
> Now it comes the fun part.
>
> I reboot the machine with the missing disk.
>
> What happens now?
>
> I had pulled da4 I think. So, disks with an ID > 4 have been renamed to N
>  - 1. da5 became da4, da6 became da5, da7 became da6... and, critically,
> da12 became da11.
>
> The reboot begun by failing to mount the root filesystem, but that one is
> trivial. Just tell the kernel where it is now (da11) and it boots happily.
>
> Now, we have a degraded pool with a missing disk (da4) and a da4 that
> previously was da5. It works of course, but in degraded state.
>
> OK, we found a replacement disk, and we plugged it. It became, guess! Yes,
> da12.
>
> Now: I cannot "online" da4, because it exists. I cannot online da12
>  because it didn't belong to the pool. I cannot replace da4 with da12,
> because it is there.
>
> Now that I think of it, in this case:
> 15896790606386444480  OFFLINE      0     0     0  was /dev/da4
>
> Is it possible to say zpool replace 15896790606386444480 da12? I haven't
> tried it.
>
> Anyway, seems to be a bit confusing. The logical, albeit cumbersome
> approach is to reboot the machine with the new da4 in place, and after
> rebooting, onlining or replacing.
>
> Using names prevents this kind of confusion.
>
>
>
>
>
Same thing happened to me on a production server that crashed. Sometimes,
it;s not easy to reboot again simply because you need to insert a disk.
This, of course, make s the hot-swappable capability of the hardware,
useless.

--
Best regards,
Claudiu Vasadi
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Kevin Day-3
In reply to this post by Warren Block

On Jan 22, 2013, at 9:12 AM, Warren Block <[hidden email]> wrote:
>
> I would like to see guidelines for at least two common scenarios:
>
> Multi-terabyte file server with multi-drive pool.

[…]

> The first is easy with the defaults, but particular tuning could be beneficial.  And would be a good place to talk about NFS on ZFS, usage of SSDs, and so on.


I run ftpmirror.your.org, which is a 72 x 3TB drive ZFS server. It's a very busy server. It currently houses the only off-site backup of all of the Wikimedia projects(121TB), a full FreeBSD FTP mirror(1T), a full CentOS mirror,  all of FreeBSD-Archive(1.5TB), FreeBSD-CVS, etc. It's usually running between 100 and 1500mbps of ethernet traffic in/out of it. There are usually around 15 FTP connections, 20-50 HTTP connections, 10 rsync connections and 1 or 2 CVS connections.

The only changes we've made that are ZFS specific are atime=off and sync=disabled. Nothing we do uses atimes so disabling that cuts down on a ton of unnecessary writes. Disabling sync is okay here too - we're just mirroring stuff that's available elsewhere, so there's no threat of data loss. Other than some TCP tuning in sysctl.conf, this is running a totally stock kernel with no special settings.

I've looked at using an SSD for meta-data only caching, but it appears that we've got far more than 256GB of metadata here that's being accessed regularly (nearly every file is being stat'ed when rsync runs) so I'm guessing it's not going to be incredibly effective unless I buy a seriously large SSD.

If you have any specific questions I'm happy to answer though.

-- Kevin

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Mark Felder-4
On Tue, 22 Jan 2013 09:40:24 -0600, Kevin Day <[hidden email]>  
wrote:

> I've looked at using an SSD for meta-data only caching, but it appears  
> that we've got far more than 256GB of metadata here that's being  
> accessed regularly (nearly every file is being stat'ed when rsync runs)  
> so I'm guessing it's not going to be incredibly effective unless I buy a  
> seriously large SSD.

Well, 512GB SSDs are less than $500 now (Samsung 830) but can't you just  
add multiple SSDs? AFAIK you can have multiple L2ARC devices and ZFS will  
just split the load between them.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Freddie Cash-8
In reply to this post by Warren Block
On Jan 22, 2013 7:04 AM, "Warren Block" <[hidden email]> wrote:
>
> On Tue, 22 Jan 2013, Borja Marcos wrote:
>
>> 1- Dynamic disk naming -> We should use static naming (GPT labels, for
instance)
>>
>> ZFS was born in a system with static device naming (Solaris). When you
plug a disk it gets a fixed name. As far as I know, at least from my
experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic
naming can be very problematic.
>>
>> For example, imagine that I have 16 disks, da0 to da15. One of them,
say, da5, dies. When I reboot the machine, all the devices from da6 to da15
will be renamed to the device number -1. Potential for trouble as a minimum.
>>
>> After several different installations, I am preferring to rely on static
naming. Doing it with some care can really help to make pools portable from
one system to another. I create a GPT partition in each drive, and Iabel it
with a readable name. Thus, imagine I label each big partition (which takes
the whole available space) as pool-vdev-disk, for example,
pool-raidz1-disk1.
>
>
> I'm a proponent of using various types of labels, but my impression after
a recent experience was that ZFS metadata was enough to identify the drives
even if they were moved around.  That is, ZFS bare metadata on a drive with
no other partitioning or labels.
>
> Is that incorrect?

The ZFS metadata on disk allows you to move disks around in a system and
still import the pool, correct.

But the ZFS metadata will not help you figure out which disk, in which bay,
of which drive shelf just died and needs to be replaced.

That's where glabels, gpt labels, and similar come in handy. It's for the
sysadmin, not the system itself.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Michael DeMan
In reply to this post by Borja Marcos-2
I think this would be awesome.  Googling around it is extremely difficult to know what to do and which practices are current or obsolete, etc.

I would suggest maybe some separate sections so the information is organized well and can be easily maintained?


MAIN:
- recommended anybody using ZFS have a a 64-bit processor and 8GB RAM.
- I don't know, but it seems to me that much of what would go in here is fairly well known now and probably not changing much?

ROOT ON ZFS:
- section just for this

32-bit AND/OR TINY MEMORY:
- all the tuning needed for the people that aren't following recommended 64-bit+8GB RAM setup.
- probably there are enough people even though it seems pretty obvious in a couple more years nobody will have 32-bit or less than 8GB RAM?



A couple more things for subsections in topic MAIN - lots of stuff to go in there...


PARTITIONING:
I could be disinformed here, but my understanding) is best practice is to use gpart + gnop to:
#1.  Ensure proper alignment for 4K sector drives - the latest western digitals still report as 512.
#2.  Ensure a little extra space is left on the drive since if the whole drive is used, a replacement may be a tiny bit smaller and will not work.
#3.  Label the disks so you know what is what.

MAPPING PHYSICAL DRIVES:
Particularly and issue with SATA drives - basically force the mapping so if the system reboots with a drive missing (or you add drives) you know what is what.
- http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/011039.html
- so you can put a label on the disk caddies and when the system says 'diskXYZ' died - you can just look at the label on the front of the box and change 'diskXYZ'.
- also without this - if you reboot after adding disks or with a disk missing - all the adaXYZ numbering shifts :(


SPECIFIC TUNABLES
- there are still a myriad of specific tunables that can be very helpful even with a 8GB+

ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here
- why the ZIL is a good thing even you think it kills your NFS performance
- no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 disks, etc.
- striping over raidz1/raidz2 pools
- striping over mirrors
- etc...











On Jan 22, 2013, at 3:03 AM, Borja Marcos <[hidden email]> wrote:

> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)
>
>
>
> Hi :)
>
> Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").
>
> ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.
>
> Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus.
>
>
>
> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
>
> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
>
> For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.
>
> After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.
>
> When creating a pool, I use these names. Instead of dealing with device numbers. For example:
>
> % zpool status
>  pool: rpool
> state: ONLINE
>  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 2013
> config:
>
> NAME             STATE     READ WRITE CKSUM
> rpool            ONLINE       0     0     0
>  mirror-0       ONLINE       0     0     0
>    gpt/rpool-disk1       ONLINE       0     0     0
>    gpt/rpool-disk2       ONLINE       0     0     0
> logs
>  gpt/zfs-log    ONLINE       0     0     0
> cache
>  gpt/zfs-cache  ONLINE       0     0     0
>
> Using a unique name for each disk within your organization is important. That way, you can safely move the disks to a different server, which might be using ZFS, and still be able to import the pool without name collisions. Of course  you could use gptids, which, as far as I know, are unique, but they are difficult to use and in case  of a disk failure it's not easy to determine which disk to replace.
>
>
>
>
> 2- RAID cards.
>
> Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards.
>
> There are two solutions adopted by RAID card users. None of them is good. The first an obvious one is to create a RAID5 taking advantage of the battery based cache (if present). It works, but it loses some of the advantages of ZFS. Moreover, trying different cards, I have been forced to reboot whole servers in order to do something trivial like replacing a failed disk. Yes, there are software tools to control some of the cards, but they are at the very least cumbersome and confusing.
>
> The second "solution" is to create a RAID0 volume for each disk (some RAID card manufacturers even dare to call it JBOD). I haven't seen a single instance of this working flawlessly. Again, a replaced disk can be a headache. At the very least, you have to deal with a cumbersome and complicated management program to replace a disk, and you often have to reboot the server.
>
> The biggest reason to avoid these stupid cards, anyway, is plain simple: Those cards, at least the ones I have tried bundled by Dell as PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense codes from the filesystem. Pure crap.
>
> Years ago, fighting this issue, and when ZFS was still rather experimental, I asked for help and Scott Long sent me a "don't try this at home" simple patch, so that the disks become available to the CAM layer, bypassing the RAID card. He warned me of potential issues and lost sense codes, but, so far so good. And indeed the sense codes are lost when a RAID card creates a volume, even if in the misnamed "JBOD" configuration.
>
>
> http://www.mavetju.org/mail/view_message.php?list=freebsd-scsi&id=2634817&raw=yes
> http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679
>
> Anyway, even if there might be some issues due to command handling, the end to end verification performed by ZFS should ensure that, as a minimum, the data on the disks won't  be corrupted and, in case it happens, it will be detected. I rather prefer to have ZFS deal with it, instead of working on a sort of "virtual" disk implemented on the RAID card.
>
> Another *strong* reason to avoid those cards, even "JBOD" configurations, is disk portability. The RAID labels the disks. Moving one disk from one machine to another one will result on a funny situation of confusing "import foreign config/ignore" messages when rebooting the destination server (mandatory in order to be able to access the transferred disk). Once again, additional complexity, useless layering and more reboots. That may be acceptable for Metoosoft crap, not for Unix systems.
>
> Summarizing: I would *strongly* recommend to avoid the RAID cards and get proper host adapters without any fancy functionalities instead. The one sold by Dell as H200 seems to work very well. No need to create any JBOD or fancy thing at all. It will just expose the drivers as normal SAS/SATA ones. A host adapter without fancy firmware is the best guarantee about failures caused by fancy firmware.
>
> But, in case that´s not possible, I am still leaning to the kludge of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing by patching the driver. There is one issue, though. In case of reboot, the RAID cards freeze, I am not sure why. Maybe that could be fixed,  it happens on machines on which I am not using the RAID functionality at all. They should become "transparent" but they don't.
>
> Also, I think that  the so-called JBOD thing would impair the correct performance of a zfs health daemon doing things such as automatic failed disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ log message for diagnosis.
>
> (See at the bottom to read about a problem I have just had with a "JBOD" configuration)
>
>
>
>
> 3- Installation, boot, etc.
>
> Here I am not sure. Before zfsboot became available, I used to create a zfs-on-root system by doing, more or less, this:
>
> - Install base system on a pendrive. After the installation, just /boot will be used  from the pendrive, and /boot/loader.conf will
>
> - Create the ZFS pool.
>
> - Create and populate the root hierarchy. I used to create something like:
>
> pool/root
> pool/root/var
> pool/root/usr
> pool/root/tmp
>
> Why pool/root instead of simply "pool"? Because it's easier to understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's possible to snapshot the whole "system" tree atomically.
>
> I also set the mountpoint of the "system" tree as legacy, and rely on /etc/fstab. Why? In order to avoid an accidental "auto mount"  of critical filesystems in case, for example, I boot off a pendrive in order to tinker.
>
> For the last system I installed, I tried with zfsboot instead of booting off the /boot directory of a FFS partition.
>
>
>
>
> (*) An example of RAID/JBOD induced crap and the problem of not using static naming follows,
>
> I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and one of those cards I worship: this particular example is controlled by the aac driver.
>
> As I was going to tinker a lot, I decided to create a raid-based mirror for the system, so that I can boot off it and have swap even with a failed disk, and use the other 14 disks as a pool with two raidz vdevs of 6 disks, leaving two disks as hot-spares. Later  I removed one of the hot-spares and I installed a SSD disk with two partitions to try and make it work as L2ARC  and log. As I had gone for the jbod pain, of course replacing that disk meant rebooting the server in order to do something as illogical as creating a "logical" volume on top of it. These cards just love to be rebooted.
>
>  pool: pool
> state: ONLINE
>  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 2013
> config:
>
> NAME             STATE     READ WRITE CKSUM
> pool             ONLINE       0     0     0
>  raidz1-0       ONLINE       0     0     0
>    aacd1        ONLINE       0     0     0
>    aacd2        ONLINE       0     0     0
>    aacd3        ONLINE       0     0     0
>    aacd4        ONLINE       0     0     0
>    aacd5        ONLINE       0     0     0
>    aacd6        ONLINE       0     0     0
>  raidz1-1       ONLINE       0     0     0
>    aacd7        ONLINE       0     0     0
>    aacd8        ONLINE       0     0     0
>    aacd9        ONLINE       0     0     0
>    aacd10       ONLINE       0     0     0
>    aacd11       ONLINE       0     0     0
>    aacd12       ONLINE       0     0     0
> logs
>  gpt/zfs-log    ONLINE       0     0     0
> cache
>  gpt/zfs-cache  ONLINE       0     0     0
> spares
>  aacd14         AVAIL  
>
> errors: No known data errors
>
>
>
> The fun begun when a disk failed. When it happened, I offlined it, and replaced it by the remaining hot-spare. But something had changed, and the pool remained in this state:
>
> % zpool status
>  pool: pool
> state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 'zpool replace'.
>  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 2012
> config:
>
> NAME                        STATE     READ WRITE CKSUM
> pool                        DEGRADED     0     0     0
>  raidz1-0                  DEGRADED     0     0     0
>    spare-0                 DEGRADED     0     0     0
>      13277671892912019085  OFFLINE      0     0     0  was /dev/aacd1
>      aacd14                ONLINE       0     0     0
>    aacd2                   ONLINE       0     0     0
>    aacd3                   ONLINE       0     0     0
>    aacd4                   ONLINE       0     0     0
>    aacd5                   ONLINE       0     0     0
>    aacd6                   ONLINE       0     0     0
>  raidz1-1                  ONLINE       0     0     0
>    aacd7                   ONLINE       0     0     0
>    aacd8                   ONLINE       0     0     0
>    aacd9                   ONLINE       0     0     0
>    aacd10                  ONLINE       0     0     0
>    aacd11                  ONLINE       0     0     0
>    aacd12                  ONLINE       0     0     0
> logs
>  gpt/zfs-log               ONLINE       0     0     0
> cache
>  gpt/zfs-cache             ONLINE       0     0     0
> spares
>  2388350688826453610       INUSE     was /dev/aacd14
>
> errors: No known data errors
> %
>
>
> ZFS was somewhat confused by the JBOD volumes, and it was impossible to end this situation. A reboot revealed that the card,  apparently, had changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a single bit of data, but the situation seemed to be risky. Finally I could fix it by replacing the failed disk, rebooting the whole server, of course, and doing a zpool replace. But the card added some confusion, and I still don't know what was the disk failure. No traces of a meaningful error message.
>
>
>
>
> Best regards,
>
>
>
>
>
>
> Borja.
>
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "[hidden email]"

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Michael DeMan
In reply to this post by Warren Block

On Jan 22, 2013, at 7:04 AM, Warren Block <[hidden email]> wrote:

> On Tue, 22 Jan 2013, Borja Marcos wrote:
>
>> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
>>
>> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
>>
>> For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.
>>
>> After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.
>
> I'm a proponent of using various types of labels, but my impression after a recent experience was that ZFS metadata was enough to identify the drives even if they were moved around.  That is, ZFS bare metadata on a drive with no other partitioning or labels.
>
> Is that incorrect?

I don't know if it is correct or not, but the best I could figure out was to both label the drives and also force the mapping so the physical and logical drives always show up associated correctly.
I also ended up deciding I wanted the hostname as a prefix for the labels - so if they get moved around to say another machine I can look and know what is going on - 'oh yeah, those disks are from the ones we moved over to this machine'...

Again - no idea if this is right or 'best practice' but it was what I ended up doing since we don't have that 'best practice' document.


Basically what I came to was:

#1.  Map the physical drive slots to how they show up in FBSD so if a disk is removed and the machine is rebooted all the disks after that removed one do not have an 'off by one error'.  i.e. if you have ada0-ada14 and remove ada8 then reboot - normally FBSD skips that missing ada8 drive and the next drive (that used to be ada9) is now called ada8 and so on...  

#2.  Use gpart+gnop to deal with 4K disk sizes in a standardized way and also to leave a little extra room so if when doing a replacement disk and that disk is a few MB smaller than the original - it all 'just works'.  (All disks are partitioned to a slightly smaller size than their physical capacity).

#3.  For ZFS - make the pool run off the labels.  The labels include in them the 'adaXXX' physical disk for easy reference.  If the disks are moved to another machine (say ada0-ada14 are moved to ada30-44 in a new box) then naturally that is off - but with the original hostname prefix in the label (presuming hostnames are unique) you can tell what is going on.  Having the disks in another host I treat as an emergency/temporary situation and the pool can be taken offline and the labels fixed up if the plan is for the disks to live in that new machine for a long time.


Example below on a test box - so if these drives got moved to another machine where ada6 and ada14 are already present -

        NAME                     STATE     READ WRITE CKSUM
        zpmirrorTEST                ONLINE       0     0     0
          mirror-0               ONLINE       0     0     0
            gpt/myhostname-ada6p1   ONLINE       0     0     0
            gpt/myhostname-ada14p1  ONLINE       0     0     0
        logs
          da1  


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD - mapping logical to physical drives

Michael DeMan
In reply to this post by Freddie Cash-8
Hi,

We have been able to effectively mitigate this (and rigorously tested) problem.
I myself am fussy and in the situation where a disk drive dies want to make sure that the data-center technician is removing/replacing exactly the correct disk.
-AND- if the machine reboots with a disk removed, or added - that it all just looks normal.

I think this is basically another item that there are standard ways to deal with it but there is no documentation?

What we did was /boot/device.hints.

On the machine we rigorously tested this on, we have in /boot/device.hints.  This is for the particular controllers as noted but I think works for any SATA or SAS controllers?


# OAIMFD 2011.04.13 adding this to force ordering on adaX disks
# dev.mvs.0.%desc: Marvell 88SX6081 SATA controller
# dev.mvs.1.%desc: Marvell 88SX6081 SATA controller

hint.scbus.0.at="mvsch0"
hint.ada.0.at="scbus0"

hint.scbus.1.at="mvsch1"
hint.ada.1.at="scbus1"

hint.scbus.2.at="mvsch2"
hint.ada.2.at="scbus2"

hint.scbus.3.at="mvsch3"
hint.ada.3.at="scbus3"

...and so on up to ada14...

Inserting disks that were empty before and rebooting, or removing disks that did exist and rebooting - it all 'just works'.




On Jan 22, 2013, at 3:02 PM, Freddie Cash <[hidden email]> wrote:

> On Jan 22, 2013 7:04 AM, "Warren Block" <[hidden email]> wrote:
>>
>> On Tue, 22 Jan 2013, Borja Marcos wrote:
>>
>>> 1- Dynamic disk naming -> We should use static naming (GPT labels, for
> instance)
>>>
>>> ZFS was born in a system with static device naming (Solaris). When you
> plug a disk it gets a fixed name. As far as I know, at least from my
> experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic
> naming can be very problematic.
>>>
>>> For example, imagine that I have 16 disks, da0 to da15. One of them,
> say, da5, dies. When I reboot the machine, all the devices from da6 to da15
> will be renamed to the device number -1. Potential for trouble as a minimum.
>>>
>>> After several different installations, I am preferring to rely on static
> naming. Doing it with some care can really help to make pools portable from
> one system to another. I create a GPT partition in each drive, and Iabel it
> with a readable name. Thus, imagine I label each big partition (which takes
> the whole available space) as pool-vdev-disk, for example,
> pool-raidz1-disk1.
>>
>>
>> I'm a proponent of using various types of labels, but my impression after
> a recent experience was that ZFS metadata was enough to identify the drives
> even if they were moved around.  That is, ZFS bare metadata on a drive with
> no other partitioning or labels.
>>
>> Is that incorrect?
>
> The ZFS metadata on disk allows you to move disks around in a system and
> still import the pool, correct.
>
> But the ZFS metadata will not help you figure out which disk, in which bay,
> of which drive shelf just died and needs to be replaced.
>
> That's where glabels, gpt labels, and similar come in handy. It's for the
> sysadmin, not the system itself.
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "[hidden email]"

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Warren Block
In reply to this post by Michael DeMan
On Tue, 22 Jan 2013, Michael DeMan wrote:

> On Jan 22, 2013, at 7:04 AM, Warren Block <[hidden email]> wrote:
>>
>> I'm a proponent of using various types of labels, but my impression
>> after a recent experience was that ZFS metadata was enough to
>> identify the drives even if they were moved around.  That is, ZFS
>> bare metadata on a drive with no other partitioning or labels.
>>
>> Is that incorrect?
>
> I don't know if it is correct or not, but the best I could figure out
> was to both label the drives and also force the mapping so the
> physical and logical drives always show up associated correctly. I
> also ended up deciding I wanted the hostname as a prefix for the
> labels - so if they get moved around to say another machine I can look
> and know what is going on - 'oh yeah, those disks are from the ones we
> moved over to this machine'...

It helps to avoid duplicate labels, a good idea.

> #1.  Map the physical drive slots to how they show up in FBSD so if a
> disk is removed and the machine is rebooted all the disks after that
> removed one do not have an 'off by one error'.  i.e. if you have
> ada0-ada14 and remove ada8 then reboot - normally FBSD skips that
> missing ada8 drive and the next drive (that used to be ada9) is now
> called ada8 and so on...

How do you do that?  If I'm in that situation, I think I could find the
bad drive, or at least the good ones, with diskinfo and the drive serial
number.  One suggestion I saw somewhere was to use disk serial numbers
for label values.

> #2.  Use gpart+gnop to deal with 4K disk sizes in a standardized way
> and also to leave a little extra room so if when doing a replacement
> disk and that disk is a few MB smaller than the original - it all
> 'just works'.  (All disks are partitioned to a slightly smaller size
> than their physical capacity).

I've been told (but have not personally verified) that newer versions of
ZFS actually leaves some unused space at the end of a drive to allow for
variations in nominally-sized drives.  Don't know how much.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
jas
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

jas
On 22/01/2013 9:16 PM, Warren Block wrote:

> On Tue, 22 Jan 2013, Michael DeMan wrote:
>
>> On Jan 22, 2013, at 7:04 AM, Warren Block <[hidden email]> wrote:
>>>
>>> I'm a proponent of using various types of labels, but my impression
>>> after a recent experience was that ZFS metadata was enough to
>>> identify the drives even if they were moved around. That is, ZFS
>>> bare metadata on a drive with no other partitioning or labels.
>>>
>>> Is that incorrect?
>>
>> I don't know if it is correct or not, but the best I could figure out
>> was to both label the drives and also force the mapping so the
>> physical and logical drives always show up associated correctly. I
>> also ended up deciding I wanted the hostname as a prefix for the
>> labels - so if they get moved around to say another machine I can
>> look and know what is going on - 'oh yeah, those disks are from the
>> ones we moved over to this machine'...
>
> It helps to avoid duplicate labels, a good idea.
>
>> #1.  Map the physical drive slots to how they show up in FBSD so if a
>> disk is removed and the machine is rebooted all the disks after that
>> removed one do not have an 'off by one error'.  i.e. if you have
>> ada0-ada14 and remove ada8 then reboot - normally FBSD skips that
>> missing ada8 drive and the next drive (that used to be ada9) is now
>> called ada8 and so on...
>
> How do you do that?  If I'm in that situation, I think I could find
> the bad drive, or at least the good ones, with diskinfo and the drive
> serial number.  One suggestion I saw somewhere was to use disk serial
> numbers for label values.
I think that was using /boot/device.hints.  Unfortunately it only works
for some systems, and not for all..  and someone shared an experience
with me where a kernel update caused the card probe order to change, the
devices to change, and then it all broke...  It worked for one card, not
for the other...  I gave up because I wanted consistency across
different systems..

In my own opinion, the whole process of partitioning drives, labelling
them, all kinds of tricks for dealing with 4k drives, manually
configuring /boot/device.hints, etc. is something that we have to do,
but honestly, I really believe there *has* to be a better way....  Years
back when I was using a 3ware/AMCC RAID card (actually, I AM still using
a few), none of this was an issue... every disk just appeared in order..
I didn't have to configure anything specially ..  ordering never changed
when I removed a drive, I didn't need to partition or do anything with
the disks - just give it the raw disks, and it knew what to do...  If
anything, I took my labeller and labelled the disk bays with a numeric
label so when I got an error, I knew which disk to pull, but order never
changed, and I always pulled the right drive... Now, I look at my pricey
"new" system, see disks ordered by default in what seems like an almost
"random" order... I dded each drive to figure out the exact ordering,
and labelled the disks, but it just gets really annoying....

>
>> #2.  Use gpart+gnop to deal with 4K disk sizes in a standardized way
>> and also to leave a little extra room so if when doing a replacement
>> disk and that disk is a few MB smaller than the original - it all
>> 'just works'.  (All disks are partitioned to a slightly smaller size
>> than their physical capacity).
>
> I've been told (but have not personally verified) that newer versions
> of ZFS actually leaves some unused space at the end of a drive to
> allow for variations in nominally-sized drives.  Don't know how much.

You see... this point has been mentioned on the list a whole bunch of
times, and is exactly the type of information that needs to make it into
a "best practices".  Does anyone know if this applies to ZFS in
FreeBSD?  From what version?  Who knows, maybe a whole bunch of people
are partitioning devices that don't need to be!  :)

Jason.


--
Jason Keltz
Manager of Development
Department of Computer Science and Engineering
York University, Toronto, Canada
Tel: 416-736-2100 x. 33570
Fax: 416-736-5872

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Michael DeMan
Inline below...
On Jan 22, 2013, at 6:40 PM, Jason Keltz <[hidden email]> wrote:
<SNIP>
>>> #1.  Map the physical drive slots to how they show up in FBSD so if a disk is removed and the machine is rebooted all the disks after that removed one do not have an 'off by one error'.  i.e. if you have ada0-ada14 and remove ada8 then reboot - normally FBSD skips that missing ada8 drive and the next drive (that used to be ada9) is now called ada8 and so on...
>>
>> How do you do that?  If I'm in that situation, I think I could find the bad drive, or at least the good ones, with diskinfo and the drive serial number.  One suggestion I saw somewhere was to use disk serial numbers for label values.
> I think that was using /boot/device.hints.  Unfortunately it only works for some systems, and not for all..  and someone shared an experience with me where a kernel update caused the card probe order to change, the devices to change, and then it all broke...  It worked for one card, not for the other...  I gave up because I wanted consistency across different systems..

I am not sure, but possibly I hit that same issue about pci-probing with our ZFS test machine - basically I vaguely recall asking to have the SATA controllers have their slots swapped without completely knowing why it needed to be done other than it did need to be done.  It could have been from an upgrade from FBSD 7.x -> 8.x -> 9.x, or could have just because its a test box and there were other things going on with for a while and the cards had got put back in out of order after doing some other stuff.

This is actually kind of an interesting problem overall - logical vs. physical and how to keep things mapped in a way that makes sense.  The linux community has run into this and substantially (from a basic end user perspective) changed the way they deal with hardware MAC addresses and ethernet cards between RHEL5 and RHEL6.  Ultimately neither of their techniques works very well.  For the FreeBSD community we should probably pick one or another strategy and just standardize on it with its warts and all and have it documented?

>
> In my own opinion, the whole process of partitioning drives, labelling them, all kinds of tricks for dealing with 4k drives, manually configuring /boot/device.hints, etc. is something that we have to do, but honestly, I really believe there *has* to be a better way....  

I agree.  At this point the only solution I can think of to be able to use ZFS on FreeBSD for production systems is to write scripts that do all of this - all the goofy gpart + gnop + everything else.  How is anybody supposed to replace a disk in a system in an emergency situation by having to run a bunch of cryptic command line stuff on a disk before they can even confidently put it in as a replacement for the original?  And by definition of having to do a bunch of manual command line stuff you can not be reliably confident?

> Years back when I was using a 3ware/AMCC RAID card (actually, I AM still using a few), none of this was an issue... every disk just appeared in order.. I didn't have to configure anything specially ..  ordering never changed when I removed a drive, I didn't need to partition or do anything with the disks - just give it the raw disks, and it knew what to do...  If anything, I took my labeller and labelled the disk bays with a numeric label so when I got an error, I knew which disk to pull, but order never changed, and I always pulled the right drive... Now, I look at my pricey "new" system, see disks ordered by default in what seems like an almost "random" order... I dded each drive to figure out the exact ordering, and labelled the disks, but it just gets really annoying....


A lot of these things - about making sure that a little extra space is spared on the drive when an array is first built so that when a new drive with slightly smaller capacity is the replacement - the RAID vendors have hidden that away from the end user.  In many cases they have only done that in the last 10 years or so?  And I stumbled a few weeks ago about a Sun ZFS user that had received Sun certified disks that had the same issue - a few sectors too small...


Overall you are describing exactly the kind of behavior I want, and I think everybody needs from a FreeBSD+ZFS system.

- Alarm sent out - drive #52 failed- wake up and deal with it.
- Go to server (or call data center) - groggily look at labels on front of disk caddies - physically pull drive #52
- insert new similarly sized drive from inventory as new #52.  
- Verify resilver is in progress
- Confidently go back to bed knowing all is okay

The above scenario is just unworkable right now for most people (even tech-savvy people) because of the lack of documentation - hence I am glad to see some kind of 'best practices' document put together.
<SNIP>

- Mike



_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

David Xu-2
In reply to this post by Scott Long-2
On 2013/01/22 22:33, Scott Long wrote:

>
> On Jan 22, 2013, at 4:03 AM, Borja Marcos <[hidden email]> wrote:
>
>> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)
>>
>>
>>
>> Hi :)
>>
>> Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").
>>
>> ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.
>>
>> Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus.
>>
>>
>>
>> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
>>
>> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
>>
>
> Look up SCSI device wiring in /sys/conf/NOTES.  That's one solution to static naming, just with a slightly different angle than Solaris.  I do agree with your general thesis here, and either wiring should be made a much more visible and documented feature, or a new mechanism should be developed to provide naming stability.  Please let me know what you think of the wiring mechanic.
>>
>>

I am curious, because we already have devfs, why do not the driver
create device entry like following ?

/dev/scsi/bus0/target0/lun0/ada0
/dev/scsi/bus0/target0/lun0/ada0s1
/dev/scsi/bus0/target0/lun0/ada0s2
...

This will eliminate the needs of hints.


>> 2- RAID cards.
>>
>> Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards.
>>
>
> Agree 200%.  Despite the best effort of sales and marketing people, RAID cards do not make good HBAs.  At best they add latency.  At worst, they add a lot of latency and extra failure modes.
>
> Scott
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "[hidden email]"
>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Suggesting ZFS "best practices" in FreeBSD

Harold Paulson-3
In reply to this post by jas
I'm not allowed to edit this page:

  https://wiki.freebsd.org/action/edit/ZFSBestPractices?action=edit

but someone should start pouring this info into the wiki.  It's gold.

        - H


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
12