bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Victor Sudakov-3
Dear Colleagues,

I have a FreeBSD 11.2 system with 32G RAM which I'm going to use as a
bhyve host with zvols and sparse zvols, with (almost) no
services/daemons of its own.

Could you please clarify some points for me?

1. Does ARC actually cache zfs volumes (not files/datasets)?

2. If ARC does cache volumes, does this cache make sense on a hypervisor,
because guest OSes will probably have their own disk cache anyway.

3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of
total RAM, so that most RAM is available to guest machines?

4. What other zfs tuning measures can you suggest for a bhyve
hypervisor?


--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Patrick M. Hausen
Hi!

> Am 19.03.2019 um 03:46 schrieb Victor Sudakov <[hidden email]>:
> 1. Does ARC actually cache zfs volumes (not files/datasets)?

Yes it does.

> 2. If ARC does cache volumes, does this cache make sense on a hypervisor,
> because guest OSes will probably have their own disk cache anyway.

IMHO not much, because the guest OS is relying on the fact that when
it writes it’s own cached data out to „disk“, it will be committed to
stable storage.

> 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of
> total RAM, so that most RAM is available to guest machines?

Yes if you build your own solution on plain FreeBSD. No if you are running
FreeNAS which already tries to autotune the ARC size according to the
memory committed to VMs.

> 4. What other zfs tuning measures can you suggest for a bhyve
> hypervisor?

e.g.
        zfs set sync=always zfs/vm

if zfs/vm is the dataset under which you create the ZVOLs for your emulated
disks.

I’m using this for all my VM „disks“ and have added a 16 GB SLOG device
to my spinning disk pool - seems to work great. This is on a home system.

Our new data centre systems feature all NVME SSDs and no spinning rust.
So no need for a separate SLOG.

HTH,
Patrick
--
punkt.de GmbH Internet - Dienstleistungen - Beratung
Kaiserallee 13a Tel.: 0721 9109-0 Fax: -100
76133 Karlsruhe [hidden email] http://punkt.de
AG Mannheim 108285 Gf: Juergen Egeling

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Victor Sudakov-3
Patrick M. Hausen wrote:

>
> > 1. Does ARC actually cache zfs volumes (not files/datasets)?
>
> Yes it does.
>
> > 2. If ARC does cache volumes, does this cache make sense on a hypervisor,
> > because guest OSes will probably have their own disk cache anyway.
>
> IMHO not much, because the guest OS is relying on the fact that when
> it writes it’s own cached data out to „disk“, it will be committed to
> stable storage.
This is an important point.

> > 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of
> > total RAM, so that most RAM is available to guest machines?
>
> Yes if you build your own solution on plain FreeBSD. No if you are running
> FreeNAS which already tries to autotune the ARC size according to the
> memory committed to VMs.
>
> > 4. What other zfs tuning measures can you suggest for a bhyve
> > hypervisor?
>
> e.g.
> zfs set sync=always zfs/vm
>
> if zfs/vm is the dataset under which you create the ZVOLs for your emulated
> disks.
Well, bhyve already has an option for this:

The block-device-options are:

nocache   Open the file with O_DIRECT.
direct    Open the file using O_SYNC.
ro        Force the file to be opened read-only.

I think something like
"-s 4:0,virtio-blk,/dev/zvol/zroot/vm/mail/disk0,direct"
would do the same?

>
> I’m using this for all my VM „disks“ and have added a 16 GB SLOG device
> to my spinning disk pool - seems to work great. This is on a home system.

Is SLOG also used by zfs volumes?

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

freebsd-virtualization mailing list
Patrick M. Hausen wrote:
>
> > 1. Does ARC actually cache zfs volumes (not files/datasets)?
>
> Yes it does.

I find this distinction between volumes/files/etc and what is cached causes confusion (as well as "volumes not datasets").

Both ZVOLs and Z file systems are types of dataset. A dataset stores data in records (usually up to 128kb in size).
It's these records that are cached (and that most ZFS functions such as compression/raidz/zil/etc work with)

As far as the ZFS lower levels are concerned, there is no difference between a volume and a file system.

>
> > 2. If ARC does cache volumes, does this cache make sense on a
> > hypervisor, because guest OSes will probably have their own disk cache anyway.
>
> IMHO not much, because the guest OS is relying on the fact that when
> it writes it’s own cached data out to „disk“, it will be committed to
> stable storage.

Maybe I've missed something but I don't quite get the link between read cache (ARC) and guest writes here?

> This is an important point.

> > 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less
> > of total RAM, so that most RAM is available to guest machines?
>
> Yes if you build your own solution on plain FreeBSD. No if you are
> running FreeNAS which already tries to autotune the ARC size according
> to the memory committed to VMs.
>
> > 4. What other zfs tuning measures can you suggest for a bhyve
> > hypervisor?
>
> e.g.
> zfs set sync=always zfs/vm
>
> if zfs/vm is the dataset under which you create the ZVOLs for your
> emulated disks.

>Well, bhyve already has an option for this:

>The block-device-options are:

>nocache   Open the file with O_DIRECT.
>direct    Open the file using O_SYNC.
>ro        Force the file to be opened read-only.

>I think something like
>"-s 4:0,virtio-blk,/dev/zvol/zroot/vm/mail/disk0,direct"
>would do the same?

>
> I’m using this for all my VM „disks“ and have added a 16 GB SLOG
> device to my spinning disk pool - seems to work great. This is on a home system.

> Is SLOG also used by zfs volumes?

As above, the core of ZFS doesn't care what type of dataset it is working with. ARC/ZIL/etc all work exactly the same.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Patrick M. Hausen
In reply to this post by Victor Sudakov-3
Hi!

> Am 20.03.2019 um 02:52 schrieb Victor Sudakov <[hidden email]>:
> Is SLOG also used by zfs volumes?

Yes, but for synchronous writes only, if I’m not mistaken.
So fundamentally yes, but in most cases no.

Patrick
--
punkt.de GmbH Internet - Dienstleistungen - Beratung
Kaiserallee 13a Tel.: 0721 9109-0 Fax: -100
76133 Karlsruhe [hidden email] http://punkt.de
AG Mannheim 108285 Gf: Juergen Egeling

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Patrick M. Hausen
In reply to this post by freebsd-virtualization mailing list
Hi Matt,

> Am 20.03.2019 um 10:34 schrieb Matt Churchyard <[hidden email]>:
>>> 2. If ARC does cache volumes, does this cache make sense on a
>>> hypervisor, because guest OSes will probably have their own disk cache anyway.
>>
>> IMHO not much, because the guest OS is relying on the fact that when
>> it writes it’s own cached data out to „disk“, it will be committed to
>> stable storage.
>
> Maybe I've missed something but I don't quite get the link between read cache (ARC) and guest writes here?

You are correct - I confused ARC and ZIL. I still recommend to
set sync=always for hypervisor „disk“ ZVOLs.

Kind regards
Patrick
--
punkt.de GmbH Internet - Dienstleistungen - Beratung
Kaiserallee 13a Tel.: 0721 9109-0 Fax: -100
76133 Karlsruhe [hidden email] http://punkt.de
AG Mannheim 108285 Gf: Juergen Egeling

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Mike Gerdts
In reply to this post by Patrick M. Hausen
On Tue, Mar 19, 2019 at 3:07 AM Patrick M. Hausen <[hidden email]> wrote:

> Hi!
>
> > Am 19.03.2019 um 03:46 schrieb Victor Sudakov <[hidden email]>:
> > 1. Does ARC actually cache zfs volumes (not files/datasets)?
>
> Yes it does.
>
> > 2. If ARC does cache volumes, does this cache make sense on a hypervisor,
> > because guest OSes will probably have their own disk cache anyway.
>
> IMHO not much, because the guest OS is relying on the fact that when
> it writes it’s own cached data out to „disk“, it will be committed to
> stable storage.
>

I'd recommend caching at least metadata (primarycache=metadata).  The guest
will not cache zfs metadata and not having metadata in the cache can lead
to a big hit in performance.  The metadata in question here are things like
block pointers that keep track of where the data is at - zfs can't find the
data without metadata.

I think the key decision as to whether you use primarycache=metadata or
primarycache=all comes down to whether you are after predictable
performance or optimal performance.  You will likely get worse performance
with primarycache=metaadata (or especially with primarycache=none),
presuming the host has RAM to spare.  As you pack the system with more VMs
or allocate more disk to existing VMs, you will probably find that
primarycache=metadata leads steadier performance regardless of how much
storage is used or how active other VMs are.


> > 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of
> > total RAM, so that most RAM is available to guest machines?
>
> Yes if you build your own solution on plain FreeBSD. No if you are running
> FreeNAS which already tries to autotune the ARC size according to the
> memory committed to VMs.
>
> > 4. What other zfs tuning measures can you suggest for a bhyve
> > hypervisor?
>
> e.g.
>         zfs set sync=always zfs/vm
>
> if zfs/vm is the dataset under which you create the ZVOLs for your emulated
> disks.
>

I'm not sure what the state of this is in FreeBSD, but in SmartOS we allow
the guests to benefit from write caching if they negotiate FLUSH.  Guests
that do negotiate flush are expected to use proper barriers to flush the
cache at critical times.  When a FLUSH arrives, SmartOS bhyve issues an
fsync().  To be clear - SmartOS bhyve is not actually caching writes in
memory, it is just delaying transaction group commits.  This avoids
significant write inflation and associated latency.  Support for FLUSH
negotiation has greatly improved I/O performance - to the point that some
tests show parity with running directly on the host pool.  If not already
in FreeBSD, this would probably be something of relatively high value to
pull in.

If you do go the route of sync=always and primarycache=<none|metadata>, be
sure your guest block size and host volblocksize match.  ZFS (on platforms
I'm more familiar with, at least) defaults to volblocksize=8k.  Most guest
file systems these days seem to default to a block size of 4 KiB.  If the
guest file system issues a 4 KiB aligned write, that will turn into a
read-modify-write cycle to stitch that 4 KiB guest block into the host's 8
KB block. If the adjacent guest block that is in the same 8 KiB host block
is written in the next write, it will also turn into a read-modify-write
cycle.

If you are using ZFS in the guest, this can be particularly problematic
because the guest ZFS will align writes with the guest pool's ashift, not
with a guest dataset's recordsize or volblocksize.  I discovered this in an
extended benchmarking of zfs-on-zfs - primarily with primarycache=metadata
and sync=always.  The write inflation was quite significant: 3x was
common.  I tracked some of this down to alignment issues and part of it was
due to sync writes causing the data to be written twice.

George Wilson has a great talk where he describes the same issues I hit.

https://www.youtube.com/watch?v=_-QAnKtIbGc

I've mentioned write inflation related to sync writes a few times.  One
point that I think is poorly understood is that when ZFS is rushed into
committing a write with fsync or similar, the immediate write is of ZIL
blocks to the intent log.  The intent log can be on a separate disk (slog,
log=<device>) or it can be on the disks that hold the pool's data.  When
the intent log is on the data disks, the data is written to the same disks
multiple times: once as ZIL blocks and once as data blocks.  Between these
writes there will be full-disk head movement as the uberblocks are updated
at the beginning and end of the disk.

What I say above is based on experience with kernel zones on Solaris and
bhyve on SmartOS.  There are enough similarities that I expect bhyve on
FreeBSD will be the same, but FreeBSD may have some strange-to-me zfs
caching changes.

Regards,
Mike
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Victor Sudakov-3
In reply to this post by freebsd-virtualization mailing list
Matt Churchyard wrote:

> >
> > > 1. Does ARC actually cache zfs volumes (not files/datasets)?
> >
> > Yes it does.
>
> I find this distinction between volumes/files/etc and what is cached
> causes confusion (as well as "volumes not datasets").
>
> Both ZVOLs and Z file systems are types of dataset. A dataset stores data
> in records (usually up to 128kb in size).  It's these records that are
> cached (and that most ZFS functions such as compression/raidz/zil/etc
> work with)
>
> As far as the ZFS lower levels are concerned, there is no difference
> between a volume and a file system.
Thank you Matt, this was very instructive.

> > > 2. If ARC does cache volumes, does this cache make sense on a
> > > hypervisor, because guest OSes will probably have their own disk cache anyway.
> >
> > IMHO not much, because the guest OS is relying on the fact that when
> > it writes it’s own cached data out to „disk“, it will be committed to
> > stable storage.
>
> Maybe I've missed something but I don't quite get the link between
> read cache (ARC) and guest writes here?

Maybe there was a confusion between read and write caches, but my
question still stands:

Does it make sense to cache the same data (for reading too) twice: one time
in host's RAM (ZFS ARC) and the other time in guest's RAM (whatever fs the
guest uses, all modern OSes have disk caches)?

What do VMWare or VirtualBox do for this situation? Do they ever cache
their volumes in the hypervisor's RAM?

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

freebsd-virtualization mailing list
> >
> > > 1. Does ARC actually cache zfs volumes (not files/datasets)?
> >
> > Yes it does.
>
> I find this distinction between volumes/files/etc and what is cached
> causes confusion (as well as "volumes not datasets").
>
> Both ZVOLs and Z file systems are types of dataset. A dataset stores
> data in records (usually up to 128kb in size).  It's these records
> that are cached (and that most ZFS functions such as
> compression/raidz/zil/etc work with)
>
> As far as the ZFS lower levels are concerned, there is no difference
> between a volume and a file system.

>Thank you Matt, this was very instructive.

> > > 2. If ARC does cache volumes, does this cache make sense on a
> > > hypervisor, because guest OSes will probably have their own disk cache anyway.
> >
> > IMHO not much, because the guest OS is relying on the fact that when
> > it writes it’s own cached data out to „disk“, it will be committed
> > to stable storage.
>
> Maybe I've missed something but I don't quite get the link between
> read cache (ARC) and guest writes here?

>Maybe there was a confusion between read and write caches, but my question still stands:

>Does it make sense to cache the same data (for reading too) twice: one time in host's RAM (ZFS ARC) and the other time in guest's RAM (whatever fs the guest uses, all modern OSes have disk caches)?

>What do VMWare or VirtualBox do for this situation? Do they ever cache their volumes in the hypervisor's RAM?

Virtualbox would be no different to bhyve in that it doesn't care what storage system you are using or how it is configured, that's up to the system admin.
I believe VMFS is more akin to other "traditional" file systems, and doesn't do RAM caching to anywhere near the extent of ZFS. I do think you can use SSD/NVMe/etc as cache in VMWare.

My initial instinct would be to keep cache on, but reduce the limit to allocate the majority of the RAM for guests. (I'd still want at least 4GB as an absolute minimum though, probably more on systems with 100GB+ total). Of course you could probably test with cache set to all/metadata and see what effect it has. Adding L2ARC may be useful if the main pool is spinning disks, but then I've heard there's a rule of thumb for requiring X amount of ARC for Y amount of L2ARC, but I'm not sure what that rule is.

I'd also be intrigued to know what the logic in FreeNAS is for it. It is simply a case of "(arc = total_ram - guest_allocated)"?
Is there a lower limit based on a percentage or total RAM, and/or a hard lower limit?

Matt

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

Patrick M. Hausen
Hi all,

> Am 21.03.2019 um 11:24 schrieb Matt Churchyard via freebsd-virtualization <[hidden email]>:
> I'd also be intrigued to know what the logic in FreeNAS is for it. It is simply a case of "(arc = total_ram - guest_allocated)"?
> Is there a lower limit based on a percentage or total RAM, and/or a hard lower limit?

The relevant code can be found here:
https://github.com/freenas/freenas/blob/master/src/middlewared/middlewared/plugins/vm.py

Kind regards,
Patrick
--
punkt.de GmbH Internet - Dienstleistungen - Beratung
Kaiserallee 13a Tel.: 0721 9109-0 Fax: -100
76133 Karlsruhe [hidden email] http://punkt.de
AG Mannheim 108285 Gf: Juergen Egeling

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"