Observations from a ZFS reorganization on 12-STABLE

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Observations from a ZFS reorganization on 12-STABLE

Karl Denninger
I've long argued that the VM system's interaction with ZFS' arc cache
and UMA has serious, even severe issues.  12.x appeared to have
addressed some of them, and as such I've yet to roll forward any part of
the patch series that is found here [
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 ] or the
Phabricator version referenced in the bug thread (which is more-complex
and attempts to dig at the root of the issue more effectively,
particularly when UMA is involved as it usually is.)

Yesterday I decided to perform a fairly significant reorganization of
the ZFS pools on one of my personal machines, including the root pool
which was on mirrored SSDs, changing to a Raidz2 (also on SSDs.)  This
of course required booting single-user from a 12-Stable memstick.

A simple "zfs send -R zs/root-save/R | zfs recv -Fuev zsr/R" should have
done it, no sweat.  The root that was copied over before I started is
uncomplicated; it's compressed, but not de-duped.  While it has
snapshots on it too it's by no means complex.

*The system failed to execute that command with an "out of swap space"
error, killing the job; there was indeed no swap configured since I
booted from a memstick.*

Huh?  A simple *filesystem copy* managed to force a 16Gb system into
requiring page file backing store?

I was able to complete the copy by temporarily adding the swap space
back on (where it would be when the move was complete) but that
requirement is pure insanity and it appears, from what I was able to
determine, that it came about from the same root cause that's been
plaguing VM/ZFS interaction since 2014 when I started work this issue --
specifically, when RAM gets low rather than evict ARC (or clean up UMA
that is allocated but unused) the system will attempt to page out
working set.  In this case since it couldn't page out working set since
there was nowhere to page it to the process involved got an OOM error
and was terminated.

*I continue to argue that this decision is ALWAYS wrong.*

It's wrong because if you invalidate cache and reclaim it you *might*
take a read from physical I/O to replace that data back into the cache
in the future (since it's not in RAM) but in exchange for a *potential*
I/O you perform a GUARANTEED physical I/O (to page out some amount of
working set) and possibly TWO physical I/Os (to page said working set
out and, later, page it back in.)

It has always appeared to me to be flat-out nonsensical to trade a
possible physical I/O (if there is a future cache miss) for a guaranteed
physical I/O and a possible second one.  It's even worse if the reason
you make that decision is that UMA is allocated but unused; in that case
you are paging when no physical I/O is required at all as the "memory
pressure" is a phantom!  While UMA is a very material performance win in
the general case to allow allocated-but-unused UMA to force paging, from
a performance perspective, appears to be flat-out insanity.  I find it
very difficult to come up with any reasonable scenario where releasing
allocated-but-unused UMA rather than paging out working set is a net
performance loser.

In this case since the system was running in single user mode the
process that got selected to be destroyed when that circumstance arose
and there was no available swap was the copy process itself.  The copy
itself did not require anywhere near all of the available non-kernel RAM.

I'm going to dig into this further but IMHO the base issue still exists,
even though the impact of it for my workloads with everything "running
normally" has materially decreased with 12.x.

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Rainer Duffner


> Am 17.03.2019 um 15:58 schrieb Karl Denninger <[hidden email]>:
>
> I've long argued that the VM system's interaction with ZFS' arc cache
> and UMA has serious, even severe issues.  12.x appeared to have
> addressed some of them, and as such I've yet to roll forward any part of
> the patch series that is found here [
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594> ] or the
> Phabricator version referenced in the bug thread (which is more-complex
> and attempts to dig at the root of the issue more effectively,
> particularly when UMA is involved as it usually is.)
>
> Yesterday I decided to perform a fairly significant reorganization of
> the ZFS pools on one of my personal machines, including the root pool
> which was on mirrored SSDs, changing to a Raidz2 (also on SSDs.)  This
> of course required booting single-user from a 12-Stable memstick.



Interesting.

The patches published before Christmas 2018 solved all of the problems I had (shared by many others, probably also visible on the FreeBSD project’s own infrastructure) with 11.2 and 12.0

I run a decently sized syslog-server and the 25MB/s stream of syslog-data was killing 11.2 almost instantly.

I have a few 11.2 systems that I haven’t patched yet - but they have north of 128GB of RAM and ARC had been configured down to 70% long before that - so I never saw the issue there.


_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Eugene Grosbein-10
In reply to this post by Karl Denninger
17.03.2019 21:58, Karl Denninger wrote:

> Huh?  A simple *filesystem copy* managed to force a 16Gb system into
> requiring page file backing store?
>
> I was able to complete the copy by temporarily adding the swap space
> back on (where it would be when the move was complete) but that
> requirement is pure insanity and it appears, from what I was able to
> determine, that it came about from the same root cause that's been
> plaguing VM/ZFS interaction since 2014 when I started work this issue --
> specifically, when RAM gets low rather than evict ARC (or clean up UMA
> that is allocated but unused) the system will attempt to page out
> working set.  In this case since it couldn't page out working set since
> there was nowhere to page it to the process involved got an OOM error
> and was terminated.
>
> *I continue to argue that this decision is ALWAYS wrong.*

I agree. Recently I've found kind-of-workaround for this problem:
increase vm.v_free_min so when "FREE" memory goes low,
page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving some memory
from WIRED to FREE quick enough so it can be re-used before bad things happen.

But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total RAM)
because kernel may start behaving strange. For 16Gb system it should be enough
to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).

This is not permanent solution in any way but it really helps.

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Pete French-3


On 17/03/2019 21:57, Eugene Grosbein wrote:

> I agree. Recently I've found kind-of-workaround for this problem:
> increase vm.v_free_min so when "FREE" memory goes low,
> page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving some
> memory
> from WIRED to FREE quick enough so it can be re-used before bad things
> happen.
>
> But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total RAM)
> because kernel may start behaving strange. For 16Gb system it should be
> enough
> to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).
>
> This is not permanent solution in any way but it really helps.

Ah, thats very interesting, thankyou for that! I;ve been bitten by this
issue too in the past, and it is (as mentioned) much improved on 12, but
the act it could still cause issues worries me.

-pete.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Walter Cramer
I suggest caution in raising vm.v_free_min, at least on 11.2-RELEASE
systems with less RAM.  I tried "65536" (256MB) on a 4GB mini-server, with
vfs.zfs.arc_max of 2.5GB.  Bad things happened when the cron daemon merely
tried to run `periodic daily`.

A few more details - ARC was mostly full, and "bad things" was 1:
`pagedaemon` seemed to be thrashing memory - using 100% of CPU, with
little disk activity, and 2: many normal processes seemed unable to run.
The latter is probably explained by `man 3 sysctl` (see entry for
"VM_V_FREE_MIN").


On Mon, 18 Mar 2019, Pete French wrote:

> On 17/03/2019 21:57, Eugene Grosbein wrote:
>> I agree. Recently I've found kind-of-workaround for this problem:
>> increase vm.v_free_min so when "FREE" memory goes low,
>> page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving some
>> memory
>> from WIRED to FREE quick enough so it can be re-used before bad things
>> happen.
>>
>> But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total RAM)
>> because kernel may start behaving strange. For 16Gb system it should be
>> enough
>> to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).
>>
>> This is not permanent solution in any way but it really helps.
>
> Ah, thats very interesting, thankyou for that! I;ve been bitten by this issue
> too in the past, and it is (as mentioned) much improved on 12, but the act it
> could still cause issues worries me.
>
> -pete.
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[hidden email]"

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Karl Denninger
In reply to this post by Pete French-3

On 3/18/2019 08:07, Pete French wrote:

>
>
> On 17/03/2019 21:57, Eugene Grosbein wrote:
>> I agree. Recently I've found kind-of-workaround for this problem:
>> increase vm.v_free_min so when "FREE" memory goes low,
>> page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving
>> some memory
>> from WIRED to FREE quick enough so it can be re-used before bad
>> things happen.
>>
>> But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total RAM)
>> because kernel may start behaving strange. For 16Gb system it should
>> be enough
>> to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).
>>
>> This is not permanent solution in any way but it really helps.
>
> Ah, thats very interesting, thankyou for that! I;ve been bitten by
> this issue too in the past, and it is (as mentioned) much improved on
> 12, but the act it could still cause issues worries me.
>
>
The code patch I developed originally essentially sought to have the ARC
code pare back before the pager started evicting working set.  A second
crack went after clearing allocated-but-not-in-use UMA.

v_free_min may not be the right place to do this -- see if bumping up
vm.v_free_target also works.

I'll stick this on my "to do" list, but it's much less critical in my
applications than it was with 10.x and 11.x, both of which suffered from
it much more-severely to the point that I was getting "stalls" that in
some cases went on for 10 or more seconds due to things like your shell
being evicted to swap to make room for arc, which is flat-out nuts. 
That, at least, doesn't appear to be a problem with 12.

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

Karl Denninger
In reply to this post by Walter Cramer
On 3/18/2019 08:37, Walter Cramer wrote:

> I suggest caution in raising vm.v_free_min, at least on 11.2-RELEASE
> systems with less RAM.  I tried "65536" (256MB) on a 4GB mini-server,
> with vfs.zfs.arc_max of 2.5GB.  Bad things happened when the cron
> daemon merely tried to run `periodic daily`.
>
> A few more details - ARC was mostly full, and "bad things" was 1:
> `pagedaemon` seemed to be thrashing memory - using 100% of CPU, with
> little disk activity, and 2: many normal processes seemed unable to
> run. The latter is probably explained by `man 3 sysctl` (see entry for
> "VM_V_FREE_MIN").
>
>
> On Mon, 18 Mar 2019, Pete French wrote:
>
>> On 17/03/2019 21:57, Eugene Grosbein wrote:
>>> I agree. Recently I've found kind-of-workaround for this problem:
>>> increase vm.v_free_min so when "FREE" memory goes low,
>>> page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving
>>> some memory
>>> from WIRED to FREE quick enough so it can be re-used before bad
>>> things happen.
>>>
>>> But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total
>>> RAM)
>>> because kernel may start behaving strange. For 16Gb system it should
>>> be enough
>>> to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).
>>>
>>> This is not permanent solution in any way but it really helps.
>>
>> Ah, thats very interesting, thankyou for that! I;ve been bitten by
>> this issue too in the past, and it is (as mentioned) much improved on
>> 12, but the act it could still cause issues worries me.
>>
Raising free_target should *not* result in that sort of thrashing. 
However, that's not really a fix standing alone either since the
underlying problem is not being addressed by either change.  It is
especially dangerous to raise the pager wakeup thresholds if you still
run into UMA allocated-but-not-in-use not being cleared out issues as
there's a risk of severe pathological behavior arising that's worse than
the original problem.

11.1 and before (I didn't have enough operational experience with 11.2
to know, as I went to 12.x from mostly-11.1 installs around here) were
essentially unusable in my workload without either my patch set or the
Phabricator one.

This is *very* workload-specific however, or nobody would use ZFS on
earlier releases, and many do without significant problems.

--
Karl Denninger
[hidden email] <mailto:[hidden email]>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Observations from a ZFS reorganization on 12-STABLE

peter.blok
Same here using mfsbsd from 11-RELEASE. First attempt I forgot to add swap - it killed the ssh I was using to issue a zfs send on the remote system.

Next attempt I added swap, but ssh got killed too.

Third attempt I used mfsbsd from 12-RELEASE. It succeeded.

Now I am using mfsbsd 11-RELEASE with added swap and vis.zfs.arc_min and arc_max to 128Mb (it is a 4GB system) and it succeeds



> On 18 Mar 2019, at 15:14, Karl Denninger <[hidden email]> wrote:
>
> On 3/18/2019 08:37, Walter Cramer wrote:
>> I suggest caution in raising vm.v_free_min, at least on 11.2-RELEASE
>> systems with less RAM.  I tried "65536" (256MB) on a 4GB mini-server,
>> with vfs.zfs.arc_max of 2.5GB.  Bad things happened when the cron
>> daemon merely tried to run `periodic daily`.
>>
>> A few more details - ARC was mostly full, and "bad things" was 1:
>> `pagedaemon` seemed to be thrashing memory - using 100% of CPU, with
>> little disk activity, and 2: many normal processes seemed unable to
>> run. The latter is probably explained by `man 3 sysctl` (see entry for
>> "VM_V_FREE_MIN").
>>
>>
>> On Mon, 18 Mar 2019, Pete French wrote:
>>
>>> On 17/03/2019 21:57, Eugene Grosbein wrote:
>>>> I agree. Recently I've found kind-of-workaround for this problem:
>>>> increase vm.v_free_min so when "FREE" memory goes low,
>>>> page daemon wakes earlier and shrinks UMA (and ZFS ARC too) moving
>>>> some memory
>>>> from WIRED to FREE quick enough so it can be re-used before bad
>>>> things happen.
>>>>
>>>> But avoid increasing vm.v_free_min too much (e.g. over 1/4 of total
>>>> RAM)
>>>> because kernel may start behaving strange. For 16Gb system it should
>>>> be enough
>>>> to raise vm.v_free_min upto 262144 (1GB) or 131072 (512M).
>>>>
>>>> This is not permanent solution in any way but it really helps.
>>>
>>> Ah, thats very interesting, thankyou for that! I;ve been bitten by
>>> this issue too in the past, and it is (as mentioned) much improved on
>>> 12, but the act it could still cause issues worries me.
>>>
> Raising free_target should *not* result in that sort of thrashing.
> However, that's not really a fix standing alone either since the
> underlying problem is not being addressed by either change.  It is
> especially dangerous to raise the pager wakeup thresholds if you still
> run into UMA allocated-but-not-in-use not being cleared out issues as
> there's a risk of severe pathological behavior arising that's worse than
> the original problem.
>
> 11.1 and before (I didn't have enough operational experience with 11.2
> to know, as I went to 12.x from mostly-11.1 installs around here) were
> essentially unusable in my workload without either my patch set or the
> Phabricator one.
>
> This is *very* workload-specific however, or nobody would use ZFS on
> earlier releases, and many do without significant problems.
>
> --
> Karl Denninger
> [hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"