Stale memory during post fork cow pmap update

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Stale memory during post fork cow pmap update

Elliott.Rabe
Greetings-

I've been hunting for the root cause of elusive, slight memory
corruptions in a large, complex process that manages many threads. All
failures and experimentation thus far has been on x86_64 architecture
machines, and pmap_pcid is not in use.

I believe I have stumbled into a very unlikely race condition in the way
the vm code updates the pmap during write fault processing following a
fork of the process.  In this situation, when the process is forked,
appropriate vm entries are marked copy-on-write. One such entry
allocated by static process initialization is frequently used by many
threads in the process.  This makes it a prime candidate to write-fault
shortly after a fork system call is made.  In this scenario, such a
fault normally burdens the faulting thread with the task of allocating a
new page, entering the page as part of managed memory, and updating the
pmap with the new physical address and the change to writeable status.  
This action is followed with an invalidation of the TLB on the current
CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
on other CPUs (there are often many active threads in this process).  
Before this remote TLB invalidation has completed, other CPUs are free
to act on either the old OR new page characteristics.  If other threads
are alive and using contents of the faulting page on other CPUs, bad
things can occur.

In one simplified and somewhat contrived example, one thread attempts to
write to a location on the faulting page under the protection of a lock
while another thread attempts to read from the same location twice in
succession under the protection of the same lock.  If both the writing
thread and reading thread are running on different CPUs, and if the
write is directed to the new physical address, the reads may come from
different physical addresses if a TLB invalidation occurs between them.  
This seemingly violates the guarantees provided by the locking
primitives and can result in subtle memory corruption symptoms.

It took me quite a while to chase these symptoms from user-space down
into the operating system, and even longer to end up with a stand-alone
test fixture able to reproduce the situation described above on demand.  
If I alter the kernel code to perform a two-stage update of the pmap
entry, the observed corruption symptoms disappear.  This two-stage
mechanism updates and invalidates the new physical address in a
read-only state first, and then does a second pmap update and
invalidation to change the status to writeable.  The intended effect was
to cause any other threads writing to the faulting page to become
obstructed until the earlier fault is complete, thus eliminating the
possibility of the physical pages having different contents until the
new physical address was fully visible.  This is goofy, and from an
efficiency standpoint it is obviously undesirable, but it was the first
thing that came to mind, and it seems to be working fine.

I am not terribly familliar with the higher level design here, so it is
unclear to me if this problem is simply a very unlikely race condition
that hasn't yet been diagnosed or if this is instead the breakdown of
some other mechanism of which I am not aware.  I would appreciate the
insights of those of you who have more history and experience with this
area of the code.

Thank you for your time!

Elliott Rabe
[hidden email]
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Konstantin Belousov-3
On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:

> Greetings-
>
> I've been hunting for the root cause of elusive, slight memory
> corruptions in a large, complex process that manages many threads. All
> failures and experimentation thus far has been on x86_64 architecture
> machines, and pmap_pcid is not in use.
>
> I believe I have stumbled into a very unlikely race condition in the way
> the vm code updates the pmap during write fault processing following a
> fork of the process.  In this situation, when the process is forked,
> appropriate vm entries are marked copy-on-write. One such entry
> allocated by static process initialization is frequently used by many
> threads in the process.  This makes it a prime candidate to write-fault
> shortly after a fork system call is made.  In this scenario, such a
> fault normally burdens the faulting thread with the task of allocating a
> new page, entering the page as part of managed memory, and updating the
> pmap with the new physical address and the change to writeable status.  
> This action is followed with an invalidation of the TLB on the current
> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
> on other CPUs (there are often many active threads in this process).  
> Before this remote TLB invalidation has completed, other CPUs are free
> to act on either the old OR new page characteristics.  If other threads
> are alive and using contents of the faulting page on other CPUs, bad
> things can occur.
>
> In one simplified and somewhat contrived example, one thread attempts to
> write to a location on the faulting page under the protection of a lock
> while another thread attempts to read from the same location twice in
> succession under the protection of the same lock.  If both the writing
> thread and reading thread are running on different CPUs, and if the
> write is directed to the new physical address, the reads may come from
> different physical addresses if a TLB invalidation occurs between them.  
> This seemingly violates the guarantees provided by the locking
> primitives and can result in subtle memory corruption symptoms.
>
> It took me quite a while to chase these symptoms from user-space down
> into the operating system, and even longer to end up with a stand-alone
> test fixture able to reproduce the situation described above on demand.  
> If I alter the kernel code to perform a two-stage update of the pmap
> entry, the observed corruption symptoms disappear.  This two-stage
> mechanism updates and invalidates the new physical address in a
> read-only state first, and then does a second pmap update and
> invalidation to change the status to writeable.  The intended effect was
> to cause any other threads writing to the faulting page to become
> obstructed until the earlier fault is complete, thus eliminating the
> possibility of the physical pages having different contents until the
> new physical address was fully visible.  This is goofy, and from an
> efficiency standpoint it is obviously undesirable, but it was the first
> thing that came to mind, and it seems to be working fine.
>
> I am not terribly familliar with the higher level design here, so it is
> unclear to me if this problem is simply a very unlikely race condition
> that hasn't yet been diagnosed or if this is instead the breakdown of
> some other mechanism of which I am not aware.  I would appreciate the
> insights of those of you who have more history and experience with this
> area of the code.

You are right describing the operation of the memory copy on fork. But I
cannot understand what parts of it, exactly, are problematic, from your
description. It is necessary for you to provide the test and provide
some kind of the test trace or the output which illustrates the issue
you found.

Do you mean something like that:
- after fork
- thread A writes into the page, causing page fault and COW because the
  entry has write permissions removed
- thread B reads from the page, and since invalidation IPI was not yet
  delivered, it reads from the need-copy page, effectively seeing the
  old content, before thread A write

And you claim is that you can create a situation where both threads A
and B owns the same lock around the write and read ?  I do not understand
this, since if thread A owns a (usermode) lock, it prevents thread B from
taking the same lock until fault is fully handled, in particular, the IPI
is delivered.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Elliott.Rabe
On 02/10/2018 05:18 AM, Konstantin Belousov wrote:

> On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
>> Greetings-
>>
>> I've been hunting for the root cause of elusive, slight memory
>> corruptions in a large, complex process that manages many threads. All
>> failures and experimentation thus far has been on x86_64 architecture
>> machines, and pmap_pcid is not in use.
>>
>> I believe I have stumbled into a very unlikely race condition in the way
>> the vm code updates the pmap during write fault processing following a
>> fork of the process.  In this situation, when the process is forked,
>> appropriate vm entries are marked copy-on-write. One such entry
>> allocated by static process initialization is frequently used by many
>> threads in the process.  This makes it a prime candidate to write-fault
>> shortly after a fork system call is made.  In this scenario, such a
>> fault normally burdens the faulting thread with the task of allocating a
>> new page, entering the page as part of managed memory, and updating the
>> pmap with the new physical address and the change to writeable status.
>> This action is followed with an invalidation of the TLB on the current
>> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
>> on other CPUs (there are often many active threads in this process).
>> Before this remote TLB invalidation has completed, other CPUs are free
>> to act on either the old OR new page characteristics.  If other threads
>> are alive and using contents of the faulting page on other CPUs, bad
>> things can occur.
>>
>> In one simplified and somewhat contrived example, one thread attempts to
>> write to a location on the faulting page under the protection of a lock
>> while another thread attempts to read from the same location twice in
>> succession under the protection of the same lock.  If both the writing
>> thread and reading thread are running on different CPUs, and if the
>> write is directed to the new physical address, the reads may come from
>> different physical addresses if a TLB invalidation occurs between them.
>> This seemingly violates the guarantees provided by the locking
>> primitives and can result in subtle memory corruption symptoms.
>>
>> It took me quite a while to chase these symptoms from user-space down
>> into the operating system, and even longer to end up with a stand-alone
>> test fixture able to reproduce the situation described above on demand.
>> If I alter the kernel code to perform a two-stage update of the pmap
>> entry, the observed corruption symptoms disappear.  This two-stage
>> mechanism updates and invalidates the new physical address in a
>> read-only state first, and then does a second pmap update and
>> invalidation to change the status to writeable.  The intended effect was
>> to cause any other threads writing to the faulting page to become
>> obstructed until the earlier fault is complete, thus eliminating the
>> possibility of the physical pages having different contents until the
>> new physical address was fully visible.  This is goofy, and from an
>> efficiency standpoint it is obviously undesirable, but it was the first
>> thing that came to mind, and it seems to be working fine.
>>
>> I am not terribly familliar with the higher level design here, so it is
>> unclear to me if this problem is simply a very unlikely race condition
>> that hasn't yet been diagnosed or if this is instead the breakdown of
>> some other mechanism of which I am not aware.  I would appreciate the
>> insights of those of you who have more history and experience with this
>> area of the code.
> You are right describing the operation of the memory copy on fork. But I
> cannot understand what parts of it, exactly, are problematic, from your
> description. It is necessary for you to provide the test and provide
> some kind of the test trace or the output which illustrates the issue
> you found.
>
> Do you mean something like that:
> - after fork
> - thread A writes into the page, causing page fault and COW because the
>    entry has write permissions removed
> - thread B reads from the page, and since invalidation IPI was not yet
>    delivered, it reads from the need-copy page, effectively seeing the
>    old content, before thread A write
>
> And you claim is that you can create a situation where both threads A
> and B owns the same lock around the write and read ?  I do not understand
> this, since if thread A owns a (usermode) lock, it prevents thread B from
> taking the same lock until fault is fully handled, in particular, the IPI
> is delivered.

Here is the sequence of actions I am referring to.  There is only one
lock, and all the writes/reads are on one logical page.

+The process is forked transitioning a map entry to COW
+Thread A writes to a page on the map entry, faults, updates the pmap to
writable at a new phys addr, and starts TLB invalidations...
+Thread B acquires a lock, writes to a location on the new phys addr,
and releases the lock
+Thread C acquires the lock, reads from the location on the old phys addr...
+Thread A ...continues the TLB invalidations which are completed
+Thread C ...reads from the location on the new phys addr, and releases
the lock

In this example Thread B and C [lock, use and unlock] properly and
neither own the lock at the same time.  Thread A was writing somewhere
else on the page and so never had/needed the lock.  Thread B sees a
location that is only ever read|modified under a lock change beneath it
while it is the lock owner.

I will get a test patch together and make it available as soon as I can.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Konstantin Belousov-3
On Sat, Feb 10, 2018 at 09:56:20PM +0000, [hidden email] wrote:

> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
> >> Greetings-
> >>
> >> I've been hunting for the root cause of elusive, slight memory
> >> corruptions in a large, complex process that manages many threads. All
> >> failures and experimentation thus far has been on x86_64 architecture
> >> machines, and pmap_pcid is not in use.
> >>
> >> I believe I have stumbled into a very unlikely race condition in the way
> >> the vm code updates the pmap during write fault processing following a
> >> fork of the process.  In this situation, when the process is forked,
> >> appropriate vm entries are marked copy-on-write. One such entry
> >> allocated by static process initialization is frequently used by many
> >> threads in the process.  This makes it a prime candidate to write-fault
> >> shortly after a fork system call is made.  In this scenario, such a
> >> fault normally burdens the faulting thread with the task of allocating a
> >> new page, entering the page as part of managed memory, and updating the
> >> pmap with the new physical address and the change to writeable status.
> >> This action is followed with an invalidation of the TLB on the current
> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
> >> on other CPUs (there are often many active threads in this process).
> >> Before this remote TLB invalidation has completed, other CPUs are free
> >> to act on either the old OR new page characteristics.  If other threads
> >> are alive and using contents of the faulting page on other CPUs, bad
> >> things can occur.
> >>
> >> In one simplified and somewhat contrived example, one thread attempts to
> >> write to a location on the faulting page under the protection of a lock
> >> while another thread attempts to read from the same location twice in
> >> succession under the protection of the same lock.  If both the writing
> >> thread and reading thread are running on different CPUs, and if the
> >> write is directed to the new physical address, the reads may come from
> >> different physical addresses if a TLB invalidation occurs between them.
> >> This seemingly violates the guarantees provided by the locking
> >> primitives and can result in subtle memory corruption symptoms.
> >>
> >> It took me quite a while to chase these symptoms from user-space down
> >> into the operating system, and even longer to end up with a stand-alone
> >> test fixture able to reproduce the situation described above on demand.
> >> If I alter the kernel code to perform a two-stage update of the pmap
> >> entry, the observed corruption symptoms disappear.  This two-stage
> >> mechanism updates and invalidates the new physical address in a
> >> read-only state first, and then does a second pmap update and
> >> invalidation to change the status to writeable.  The intended effect was
> >> to cause any other threads writing to the faulting page to become
> >> obstructed until the earlier fault is complete, thus eliminating the
> >> possibility of the physical pages having different contents until the
> >> new physical address was fully visible.  This is goofy, and from an
> >> efficiency standpoint it is obviously undesirable, but it was the first
> >> thing that came to mind, and it seems to be working fine.
> >>
> >> I am not terribly familliar with the higher level design here, so it is
> >> unclear to me if this problem is simply a very unlikely race condition
> >> that hasn't yet been diagnosed or if this is instead the breakdown of
> >> some other mechanism of which I am not aware.  I would appreciate the
> >> insights of those of you who have more history and experience with this
> >> area of the code.
> > You are right describing the operation of the memory copy on fork. But I
> > cannot understand what parts of it, exactly, are problematic, from your
> > description. It is necessary for you to provide the test and provide
> > some kind of the test trace or the output which illustrates the issue
> > you found.
> >
> > Do you mean something like that:
> > - after fork
> > - thread A writes into the page, causing page fault and COW because the
> >    entry has write permissions removed
> > - thread B reads from the page, and since invalidation IPI was not yet
> >    delivered, it reads from the need-copy page, effectively seeing the
> >    old content, before thread A write
> >
> > And you claim is that you can create a situation where both threads A
> > and B owns the same lock around the write and read ?  I do not understand
> > this, since if thread A owns a (usermode) lock, it prevents thread B from
> > taking the same lock until fault is fully handled, in particular, the IPI
> > is delivered.
>
> Here is the sequence of actions I am referring to.  There is only one
> lock, and all the writes/reads are on one logical page.
>
> +The process is forked transitioning a map entry to COW
> +Thread A writes to a page on the map entry, faults, updates the pmap to
> writable at a new phys addr, and starts TLB invalidations...
> +Thread B acquires a lock, writes to a location on the new phys addr,
> and releases the lock
> +Thread C acquires the lock, reads from the location on the old phys addr...
> +Thread A ...continues the TLB invalidations which are completed
> +Thread C ...reads from the location on the new phys addr, and releases
> the lock
>
> In this example Thread B and C [lock, use and unlock] properly and
> neither own the lock at the same time.  Thread A was writing somewhere
> else on the page and so never had/needed the lock.  Thread B sees a
> location that is only ever read|modified under a lock change beneath it
> while it is the lock owner.
I believe you mean 'Thread C' in the last sentence.

>
> I will get a test patch together and make it available as soon as I can.
Please.

So I agree that doing two-stage COW, with the first stage copying page
but keeping it read-only, is the right solution. Below is my take.
During the smoke boot, I noted that there is somewhat related issue in
reevaluation of the map entry permissions.

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index 83e12a588ee..149a15f1d9d 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -1135,6 +1157,10 @@ RetryFault:;
  */
  pmap_copy_page(fs.m, fs.first_m);
  fs.first_m->valid = VM_PAGE_BITS_ALL;
+ if ((fault_flags & VM_FAULT_WIRE) == 0) {
+ prot &= ~VM_PROT_WRITE;
+ fault_type &= ~VM_PROT_WRITE;
+ }
  if (wired && (fault_flags &
     VM_FAULT_WIRE) == 0) {
  vm_page_lock(fs.first_m);
@@ -1219,6 +1245,12 @@ RetryFault:;
  * write-enabled after all.
  */
  prot &= retry_prot;
+ fault_type &= retry_prot;
+ if (prot == 0) {
+ release_page(&fs);
+ unlock_and_deallocate(&fs);
+ goto RetryFault;
+ }
  }
  }
 
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Don Lewis-5
On 11 Feb, Konstantin Belousov wrote:

> On Sat, Feb 10, 2018 at 09:56:20PM +0000, [hidden email] wrote:
>> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
>> > On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
>> >> Greetings-
>> >>
>> >> I've been hunting for the root cause of elusive, slight memory
>> >> corruptions in a large, complex process that manages many threads. All
>> >> failures and experimentation thus far has been on x86_64 architecture
>> >> machines, and pmap_pcid is not in use.
>> >>
>> >> I believe I have stumbled into a very unlikely race condition in the way
>> >> the vm code updates the pmap during write fault processing following a
>> >> fork of the process.  In this situation, when the process is forked,
>> >> appropriate vm entries are marked copy-on-write. One such entry
>> >> allocated by static process initialization is frequently used by many
>> >> threads in the process.  This makes it a prime candidate to write-fault
>> >> shortly after a fork system call is made.  In this scenario, such a
>> >> fault normally burdens the faulting thread with the task of allocating a
>> >> new page, entering the page as part of managed memory, and updating the
>> >> pmap with the new physical address and the change to writeable status.
>> >> This action is followed with an invalidation of the TLB on the current
>> >> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
>> >> on other CPUs (there are often many active threads in this process).
>> >> Before this remote TLB invalidation has completed, other CPUs are free
>> >> to act on either the old OR new page characteristics.  If other threads
>> >> are alive and using contents of the faulting page on other CPUs, bad
>> >> things can occur.
>> >>
>> >> In one simplified and somewhat contrived example, one thread attempts to
>> >> write to a location on the faulting page under the protection of a lock
>> >> while another thread attempts to read from the same location twice in
>> >> succession under the protection of the same lock.  If both the writing
>> >> thread and reading thread are running on different CPUs, and if the
>> >> write is directed to the new physical address, the reads may come from
>> >> different physical addresses if a TLB invalidation occurs between them.
>> >> This seemingly violates the guarantees provided by the locking
>> >> primitives and can result in subtle memory corruption symptoms.
>> >>
>> >> It took me quite a while to chase these symptoms from user-space down
>> >> into the operating system, and even longer to end up with a stand-alone
>> >> test fixture able to reproduce the situation described above on demand.
>> >> If I alter the kernel code to perform a two-stage update of the pmap
>> >> entry, the observed corruption symptoms disappear.  This two-stage
>> >> mechanism updates and invalidates the new physical address in a
>> >> read-only state first, and then does a second pmap update and
>> >> invalidation to change the status to writeable.  The intended effect was
>> >> to cause any other threads writing to the faulting page to become
>> >> obstructed until the earlier fault is complete, thus eliminating the
>> >> possibility of the physical pages having different contents until the
>> >> new physical address was fully visible.  This is goofy, and from an
>> >> efficiency standpoint it is obviously undesirable, but it was the first
>> >> thing that came to mind, and it seems to be working fine.
>> >>
>> >> I am not terribly familliar with the higher level design here, so it is
>> >> unclear to me if this problem is simply a very unlikely race condition
>> >> that hasn't yet been diagnosed or if this is instead the breakdown of
>> >> some other mechanism of which I am not aware.  I would appreciate the
>> >> insights of those of you who have more history and experience with this
>> >> area of the code.
>> > You are right describing the operation of the memory copy on fork. But I
>> > cannot understand what parts of it, exactly, are problematic, from your
>> > description. It is necessary for you to provide the test and provide
>> > some kind of the test trace or the output which illustrates the issue
>> > you found.
>> >
>> > Do you mean something like that:
>> > - after fork
>> > - thread A writes into the page, causing page fault and COW because the
>> >    entry has write permissions removed
>> > - thread B reads from the page, and since invalidation IPI was not yet
>> >    delivered, it reads from the need-copy page, effectively seeing the
>> >    old content, before thread A write
>> >
>> > And you claim is that you can create a situation where both threads A
>> > and B owns the same lock around the write and read ?  I do not understand
>> > this, since if thread A owns a (usermode) lock, it prevents thread B from
>> > taking the same lock until fault is fully handled, in particular, the IPI
>> > is delivered.
>>
>> Here is the sequence of actions I am referring to.  There is only one
>> lock, and all the writes/reads are on one logical page.
>>
>> +The process is forked transitioning a map entry to COW
>> +Thread A writes to a page on the map entry, faults, updates the pmap to
>> writable at a new phys addr, and starts TLB invalidations...
>> +Thread B acquires a lock, writes to a location on the new phys addr,
>> and releases the lock
>> +Thread C acquires the lock, reads from the location on the old phys addr...
>> +Thread A ...continues the TLB invalidations which are completed
>> +Thread C ...reads from the location on the new phys addr, and releases
>> the lock
>>
>> In this example Thread B and C [lock, use and unlock] properly and
>> neither own the lock at the same time.  Thread A was writing somewhere
>> else on the page and so never had/needed the lock.  Thread B sees a
>> location that is only ever read|modified under a lock change beneath it
>> while it is the lock owner.
> I believe you mean 'Thread C' in the last sentence.
>
>>
>> I will get a test patch together and make it available as soon as I can.
> Please.
>
> So I agree that doing two-stage COW, with the first stage copying page
> but keeping it read-only, is the right solution. Below is my take.
> During the smoke boot, I noted that there is somewhat related issue in
> reevaluation of the map entry permissions.
>
> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> index 83e12a588ee..149a15f1d9d 100644
> --- a/sys/vm/vm_fault.c
> +++ b/sys/vm/vm_fault.c
> @@ -1135,6 +1157,10 @@ RetryFault:;
>   */
>   pmap_copy_page(fs.m, fs.first_m);
>   fs.first_m->valid = VM_PAGE_BITS_ALL;
> + if ((fault_flags & VM_FAULT_WIRE) == 0) {
> + prot &= ~VM_PROT_WRITE;
> + fault_type &= ~VM_PROT_WRITE;
> + }
>   if (wired && (fault_flags &
>      VM_FAULT_WIRE) == 0) {
>   vm_page_lock(fs.first_m);
> @@ -1219,6 +1245,12 @@ RetryFault:;
>   * write-enabled after all.
>   */
>   prot &= retry_prot;
> + fault_type &= retry_prot;
> + if (prot == 0) {
> + release_page(&fs);
> + unlock_and_deallocate(&fs);
> + goto RetryFault;
> + }
>   }
>   }
>  

I'm seeing really good results with this patch on my Ryzen package build
machine.  I have a set of 1782 ports that I build on a regular basis.  I
just did three consecutive poudriere runs and other than five ports that
hard fail due to incompatibility with clang 6, and one gmake-related
build runaway of lang/rust on one run that I think is another issue, I
saw none of the flakey build issues that I've had with this machine.
I've seen no recurrance of the lang/go build problems that I've had
since day 1 on this machine (but not my AMD FX-8320E machine), and also
no sign of guile-related build failures that I've seen both on this
machine and the FX-8320E.  I also didn't see the net/samba build hangs
that seem to be caused by a heavily threaded python process getting
deadlocked.

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Elliott.Rabe
In reply to this post by Konstantin Belousov-3

On 02/10/2018 04:56 PM, Konstantin Belousov wrote:

> On Sat, Feb 10, 2018 at 09:56:20PM +0000, [hidden email] wrote:
>> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
>>> On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
>>>> ...
>>>> I've been hunting for the root cause of elusive, slight memory
>>>> corruptions in a large, complex process that manages many threads. All
>>>> failures and experimentation thus far has been on x86_64 architecture
>>>> machines, and pmap_pcid is not in use.
>>>> ...
>>> It is necessary for you to provide the test and provide
>>> some kind of the test trace or the output which illustrates the issue
>>> you found.
>> Here is the sequence of actions I am referring to.  There is only one
>> lock, and all the writes/reads are on one logical page.
>>
>> +The process is forked transitioning a map entry to COW
>> +Thread A writes to a page on the map entry, faults, updates the pmap to
>> writable at a new phys addr, and starts TLB invalidations...
>> +Thread B acquires a lock, writes to a location on the new phys addr,
>> and releases the lock
>> +Thread C acquires the lock, reads from the location on the old phys addr...
>> +Thread A ...continues the TLB invalidations which are completed
>> +Thread C ...reads from the location on the new phys addr, and releases
>> the lock
>>
>> In this example Thread B and C [lock, use and unlock] properly and
>> neither own the lock at the same time.  Thread A was writing somewhere
>> else on the page and so never had/needed the lock.  Thread B sees a
>> location that is only ever read|modified under a lock change beneath it
>> while it is the lock owner.
> I believe you mean 'Thread C' in the last sentence.
You are correct, I did mean Thread C.
>> I will get a test patch together and make it available as soon as I can.
> Please.

Sorry for my delayed response; I had been working off a separate project
based on releng/11.1 and it took me longer then I expected to get a dev
rig setup off of master on which I could re-evaluate the situation.

I am attaching my test apparatus, however, calling it a test is probably
a disservice to tests everywhere.  I consider this entire fixture
disposable, so I didn't get carried away trying to properly
style/partition/locate the code.  I never wanted anything this
complicated either; it pretty much just evolved into a development aid
to spelunk around in the fault/pmap handling.  My attempts thus-far at
reducing the fixture to be user-space only have not been successful.  
Additionally, I have noticed that the fixture is /very/ sensitive to any
changes in timing; several of the debugging entries even seem key to
hitting the problem.  I didn't have much luck getting the problem to
manifest on a virtual machine guest w/ a VirtualBox host either.  For
all of these reasons, I don't think there is value here in trying to use
this as any sort of regression fixture, unless perhaps if someone is
willing to try to turn it into something less ridiculous.  Despite all
shortcomings, on my hardware anyways, it is able to reproduce the
example I described pretty much immediately when I use it with the
debugging knob "-v". Instructions and expectations are at the top of the
main test fixture source file.

I am also attaching a patch that I have been using to prevent the
problem.  I was looking at things with a much narrower view and made the
changes directly in pmap_enter.  I suspect the internal
double-update-invalidate is slightly better performance wise then taking
two whole faults, but I haven't benchmarked it, it probably doesn't
matter much compared to the cost and frequency of the actual copies, and
it also has the disadvantage of being architecture specific.  I also
don't feel like I have enough experience with the vm fault code in
general for my commentary to be very valuable here.  However, I do
wonder: 1) if there are any other scenarios where a potentially
accessible page might be undergoing an [address+writable] change in the
same way (this sort of thing seems hard to read out of code), and 2) if
there is ever any legal reason why an accessible page should be
undergoing such a change?  If not, perhaps we could come up with an
appropriate sanity-check condition to guard against any cases of this
sort of thing accidentally slipping in the future.

The attached git patches should apply and build cleanly on master commit
fe0ee5c.  I have verified at least these three scenarios in my environment:
1) the fixture alone reproduces the problem.
2) the fixture with my patch does not reproduce the problem.
3) the fixture with your patch does not reproduce the problem.

Thanks!

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"

0001-DISPOSABLE-A-test-fixture-that-can-repro-a-pmap-upda.patch (94K) Download Attachment
0002-TRIAL-Double-invalidate-when-finishing-COW-pmap-upda.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

mdtancsa
In reply to this post by Konstantin Belousov-3
On 2/10/2018 5:56 PM, Konstantin Belousov wrote:

> On Sat, Feb 10, 2018 at 09:56:20PM +0000, [hidden email] wrote:
>> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
>>> On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
>>>> Greetings-
>>>>
>>>> I've been hunting for the root cause of elusive, slight memory
>>>> corruptions in a large, complex process that manages many threads. All
>>>> failures and experimentation thus far has been on x86_64 architecture
>>>> machines, and pmap_pcid is not in use.
>>>>


The patch below seems to fix the issues I was seeing in

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584

at least I have not been able to reproduce it.  It would normally take
2-3 builds of net/samba47 to manifest, but I was able to do 70 over
night without fail.  For some reason, this issue was far more acute on
AMD Ryzen CPUs than any of the Intel CPUs I had been testing on.


> So I agree that doing two-stage COW, with the first stage copying page
> but keeping it read-only, is the right solution. Below is my take.
> During the smoke boot, I noted that there is somewhat related issue in
> reevaluation of the map entry permissions.
>
> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> index 83e12a588ee..149a15f1d9d 100644
> --- a/sys/vm/vm_fault.c
> +++ b/sys/vm/vm_fault.c
> @@ -1135,6 +1157,10 @@ RetryFault:;
>   */
>   pmap_copy_page(fs.m, fs.first_m);
>   fs.first_m->valid = VM_PAGE_BITS_ALL;
> + if ((fault_flags & VM_FAULT_WIRE) == 0) {
> + prot &= ~VM_PROT_WRITE;
> + fault_type &= ~VM_PROT_WRITE;
> + }
>   if (wired && (fault_flags &
>      VM_FAULT_WIRE) == 0) {
>   vm_page_lock(fs.first_m);
> @@ -1219,6 +1245,12 @@ RetryFault:;
>   * write-enabled after all.
>   */
>   prot &= retry_prot;
> + fault_type &= retry_prot;
> + if (prot == 0) {
> + release_page(&fs);
> + unlock_and_deallocate(&fs);
> + goto RetryFault;
> + }
>   }
>   }
>  
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "[hidden email]"
>
>


--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, [hidden email]
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Konstantin Belousov-3
In reply to this post by Elliott.Rabe
On Tue, Feb 13, 2018 at 09:10:21AM +0000, [hidden email] wrote:

>
> On 02/10/2018 04:56 PM, Konstantin Belousov wrote:
> > On Sat, Feb 10, 2018 at 09:56:20PM +0000, [hidden email] wrote:
> >> On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> >>> On Sat, Feb 10, 2018 at 05:12:11AM +0000, [hidden email] wrote:
> >>>> ...
> >>>> I've been hunting for the root cause of elusive, slight memory
> >>>> corruptions in a large, complex process that manages many threads. All
> >>>> failures and experimentation thus far has been on x86_64 architecture
> >>>> machines, and pmap_pcid is not in use.
> >>>> ...
> >>> It is necessary for you to provide the test and provide
> >>> some kind of the test trace or the output which illustrates the issue
> >>> you found.
> >> Here is the sequence of actions I am referring to.  There is only one
> >> lock, and all the writes/reads are on one logical page.
> >>
> >> +The process is forked transitioning a map entry to COW
> >> +Thread A writes to a page on the map entry, faults, updates the pmap to
> >> writable at a new phys addr, and starts TLB invalidations...
> >> +Thread B acquires a lock, writes to a location on the new phys addr,
> >> and releases the lock
> >> +Thread C acquires the lock, reads from the location on the old phys addr...
> >> +Thread A ...continues the TLB invalidations which are completed
> >> +Thread C ...reads from the location on the new phys addr, and releases
> >> the lock
> >>
> >> In this example Thread B and C [lock, use and unlock] properly and
> >> neither own the lock at the same time.  Thread A was writing somewhere
> >> else on the page and so never had/needed the lock.  Thread B sees a
> >> location that is only ever read|modified under a lock change beneath it
> >> while it is the lock owner.
> > I believe you mean 'Thread C' in the last sentence.
> You are correct, I did mean Thread C.
> >> I will get a test patch together and make it available as soon as I can.
> > Please.
>
> Sorry for my delayed response; I had been working off a separate project
> based on releng/11.1 and it took me longer then I expected to get a dev
> rig setup off of master on which I could re-evaluate the situation.
>
> I am attaching my test apparatus, however, calling it a test is probably
> a disservice to tests everywhere.  I consider this entire fixture
> disposable, so I didn't get carried away trying to properly
> style/partition/locate the code.  I never wanted anything this
> complicated either; it pretty much just evolved into a development aid
> to spelunk around in the fault/pmap handling.  My attempts thus-far at
> reducing the fixture to be user-space only have not been successful.  
> Additionally, I have noticed that the fixture is /very/ sensitive to any
> changes in timing; several of the debugging entries even seem key to
> hitting the problem.  I didn't have much luck getting the problem to
> manifest on a virtual machine guest w/ a VirtualBox host either.  For
> all of these reasons, I don't think there is value here in trying to use
> this as any sort of regression fixture, unless perhaps if someone is
> willing to try to turn it into something less ridiculous.  Despite all
> shortcomings, on my hardware anyways, it is able to reproduce the
> example I described pretty much immediately when I use it with the
> debugging knob "-v". Instructions and expectations are at the top of the
> main test fixture source file.
Apparently Ryzen CPUs are able to demonstrate it quite reliably with the
python driver for the samba build.  It was very surprising to me, esp.
because I tried to understand the Ryzen bug for the whole last week and
thought that it is more likely CPU store/load inconsistency than a software
thing.  See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584.

>
> I am also attaching a patch that I have been using to prevent the
> problem.  I was looking at things with a much narrower view and made the
> changes directly in pmap_enter.  I suspect the internal
> double-update-invalidate is slightly better performance wise then taking
> two whole faults, but I haven't benchmarked it, it probably doesn't
> matter much compared to the cost and frequency of the actual copies, and
> it also has the disadvantage of being architecture specific.  I also
> don't feel like I have enough experience with the vm fault code in
> general for my commentary to be very valuable here.  However, I do
> wonder: 1) if there are any other scenarios where a potentially
> accessible page might be undergoing an [address+writable] change in the
> same way (this sort of thing seems hard to read out of code), and 2) if
> there is ever any legal reason why an accessible page should be
> undergoing such a change?  If not, perhaps we could come up with an
> appropriate sanity-check condition to guard against any cases of this
> sort of thing accidentally slipping in the future.
I am not sure how to formulate such check. I believe there is now other
place in kernel, except the COW handling, which can create similar
situation, but I cannot guarantee it.

I also believe that the problem exists for all SMP hardware, so fixing
pmaps would create a lot of work, which is also quite hard to validate.
Note that system generally considers the faults which do not cause disk
io as 'easy', they are accounted as the 'soft faults'.  Sure, not causing
the second fault for write would be more efficient than arranging for it,
if only because we would not need to lookup and lock the map and shadow
chain of vm objects and pages, but fork is quite costly already and code
simplicity there is also important.

>
> The attached git patches should apply and build cleanly on master commit
> fe0ee5c.  I have verified at least these three scenarios in my environment:
> 1) the fixture alone reproduces the problem.
> 2) the fixture with my patch does not reproduce the problem.
> 3) the fixture with your patch does not reproduce the problem.

I put the review with the patch at https://reviews.freebsd.org/D14347.
You have to add yourself as subsriber or reviewer.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Elliott.Rabe

On 02/13/2018 07:30 AM, Konstantin Belousov wrote:
> Apparently Ryzen CPUs are able to demonstrate it quite reliably with the
> python driver for the samba build.  It was very surprising to me, esp.
> because I tried to understand the Ryzen bug for the whole last week and
> thought that it is more likely CPU store/load inconsistency than a software
> thing.  See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584.
Cool, it is good to have other validations.  As for the goose-chase,
it's totally understandable; the subtle nature of the corruptions (ie.
stale reads) often causes the symptoms to manifest as an
MP-caching-store/load problem.

>> I am also attaching a patch that I have been using to prevent the
>> problem.  I was looking at things with a much narrower view and made the
>> changes directly in pmap_enter.  I suspect the internal
>> double-update-invalidate is slightly better performance wise then taking
>> two whole faults, but I haven't benchmarked it, it probably doesn't
>> matter much compared to the cost and frequency of the actual copies, and
>> it also has the disadvantage of being architecture specific.  I also
>> don't feel like I have enough experience with the vm fault code in
>> general for my commentary to be very valuable here.  However, I do
>> wonder: 1) if there are any other scenarios where a potentially
>> accessible page might be undergoing an [address+writable] change in the
>> same way (this sort of thing seems hard to read out of code), and 2) if
>> there is ever any legal reason why an accessible page should be
>> undergoing such a change?  If not, perhaps we could come up with an
>> appropriate sanity-check condition to guard against any cases of this
>> sort of thing accidentally slipping in the future.
> I am not sure how to formulate such check. I believe there is now other
> place in kernel, except the COW handling, which can create similar
> situation, but I cannot guarantee it.
>
> I also believe that the problem exists for all SMP hardware, so fixing
> pmaps would create a lot of work, which is also quite hard to validate.
> Note that system generally considers the faults which do not cause disk
> io as 'easy', they are accounted as the 'soft faults'.  Sure, not causing
> the second fault for write would be more efficient than arranging for it,
> if only because we would not need to lookup and lock the map and shadow
> chain of vm objects and pages, but fork is quite costly already and code
> simplicity there is also important.
Fair enough, thanks for this perspective.  I now plan to apply just your
patch back in the original environment and confirm stability follows.
> I put the review with the patch at https://reviews.freebsd.org/D14347.
> You have to add yourself as subsriber or reviewer.
Done, thanks again!
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

mdtancsa
In reply to this post by Konstantin Belousov-3
On 2/13/2018 8:30 AM, Konstantin Belousov wrote:
> Apparently Ryzen CPUs are able to demonstrate it quite reliably with the
> python driver for the samba build.  It was very surprising to me, esp.
> because I tried to understand the Ryzen bug for the whole last week and
> thought that it is more likely CPU store/load inconsistency than a software
> thing.  See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584.
Just tested on an AMD EPYC 7281 (16/32 cores), and it runs into this bug
every time on the samba build.  Applying the referenced patch fixes the
problem, at least I was able to test 4 builds in a row without issue.

        ---Mike


--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, [hidden email]
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Steven Hartland
@kib do you think this issue could be the cause of the golang crashes
during fork that we spoke about a while back
https://github.com/golang/go/issues/15658

If its possible is there anything specific I could do to force a regular
occurrence in order to confirm or deny it?

On 13/02/2018 21:49, Mike Tancsa wrote:

> On 2/13/2018 8:30 AM, Konstantin Belousov wrote:
>> Apparently Ryzen CPUs are able to demonstrate it quite reliably with the
>> python driver for the samba build.  It was very surprising to me, esp.
>> because I tried to understand the Ryzen bug for the whole last week and
>> thought that it is more likely CPU store/load inconsistency than a software
>> thing.  See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584.
> Just tested on an AMD EPYC 7281 (16/32 cores), and it runs into this bug
> every time on the samba build.  Applying the referenced patch fixes the
> problem, at least I was able to test 4 builds in a row without issue.
>
> ---Mike
>
>

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Stale memory during post fork cow pmap update

Konstantin Belousov
On Wed, Feb 14, 2018 at 09:06:18AM +0000, Steven Hartland wrote:
> @kib do you think this issue could be the cause of the golang crashes
> during fork that we spoke about a while back
> https://github.com/golang/go/issues/15658
>
> If its possible is there anything specific I could do to force a regular
> occurrence in order to confirm or deny it?
I have no idea.  It is much easier to check then to try plotting theories
about applicability.

I will be not surprised by either outcome.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"