Re: bhyve guest's memory representation & live migration using COW

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: bhyve guest's memory representation & live migration using COW

Elena Mihailescu
On Thu, 7 Feb 2019 at 14:04, Elena Mihailescu
<[hidden email]> wrote:

>
> On Fri, 14 Dec 2018 at 17:02, Elena Mihailescu
> <[hidden email]> wrote:
> >
> > On Thu, 8 Nov 2018 at 20:11, John Baldwin <[hidden email]> wrote:
> > >
> > > [ Adding Patrick Mooney who hacks on bhyve @ Joyent to the cc ]
> > > On 11/7/18 2:09 PM, John Baldwin wrote:
> > > > On 10/29/18 8:39 AM, Elena Mihailescu wrote:
> > > >> On Sat, 27 Oct 2018 at 00:59, Mark Johnston <[hidden email]> wrote:
> > > >>>
> > > >>> On Thu, Oct 25, 2018 at 02:06:24PM +0300, Elena Mihailescu wrote:
> > > >>>> Hi all,
> > > >>>>
> > > >>>>
> > > >>>> On Thu, 25 Oct 2018 at 02:49, Matthew Grooms <[hidden email]> wrote:
> > > >>>>>
> > > >>>>> On 10/24/2018 1:13 PM, Mark Johnston wrote:
> > > >>>>>> On Tue, Oct 23, 2018 at 08:44:22AM -0700, John Baldwin wrote:
> > > >>>>>>> Mark Johnston and I talked a bit about this at the MeetBSD conference last week.
> > > >>>>>>> I had talked to Mark about Alan's note about msync() as I wasn't quite sure what
> > > >>>>>>> he meant by that and what he suggested is something that I think is a simpler
> > > >>>>>>> version of your proposal.  Rather than creating an entirely separate set of page
> > > >>>>>>> tables, we could perhaps add a second dirty bit for a vm_page (I think in this
> > > >>>>>>> case it would be fine to have a single bit rather than a mask) that would be the
> > > >>>>>>> "VMM dirty" bit.  Anytime the VM would mark a page as dirty due to checking PTE
> > > >>>>>>> bits and propagating them to the vm_page_t it would also set the VMM dirty bit
> > > >>>>>>> (I believe you could perhaps just do this in vm_page_dirty?).  However, the VMM
> > > >>>>>>> dirty bit would only be cleared during a migration before starting the copy of a
> > > >>>>>>> page, and the VM system would not clear this bit when it clears its own notion of
> > > >>>>>>> dirty bits.  Note that this also means that migration scans might need to check
> > > >>>>>>> the PTEs and propagate dirty bits "up" via vm_page_dirty() before checking the
> > > >>>>>>> VMM dirty bit.  This is probably a lot less work than reviving submaps and I
> > > >>>>>>> think it might be a small diff if it can be centralized in just vm_page_dirty()
> > > >>>>>>> (and if a single bit is ok vs the mask we use for the existing dirty bits).
> > > >>>>>>>
> > > >>>>>>> Mark, does that description sound accurate?
> > > >>>>>> That's basically what I had in mind as well.  In my head the scan looks
> > > >>>>>> roughly like this:
> > > >>>>>>
> > > >>>>>> foreach page m in guest physical address space:
> > > >>>>>>      vm_page_test_dirty(m);
> > > >>>>>>      if ((m->oflags & PG_VMM_DIRTY) == 0)
> > > >>>>>>       continue;
> > > >>>>>>      vm_page_xbusy(m);
> > > >>>>>>      pmap_remove_write(m);
> > > >>>>>>      m->oflags &= ~PG_VMM_DIRTY;
> > > >>>>>>      <transfer page, or make a copy>
> > > >>>>>>      vm_page_xunbusy(m);
> > > >>>>>>
> > > >>>>>> Note that the test for the dirty bit is racy; there needs to be a final
> > > >>>>>> scan which unconditionally clears PG_RW and tests for PG_M on each page
> > > >>>>>> while the VM is paused.
> > > >>>>>
> > > >>>>> At the risk of sounding painfully ignorant: Does the VMs page list need
> > > >>>>> to be locked in order to walk it safely while the VM is running? If so,
> > > >>>>> would the 'transfer page' option be too slow to perform inside this
> > > >>>>> loop? Does the 'make a copy' alternative entail copying all the dirty
> > > >>>>> pages so they can be transferred after dropping the lock?
> > > >>>>>
> > > >>>>
> > > >>>> I also have some questions about this approach:
> > > >>>>
> > > >>>> 1. As I understood from John, your suggestion is to add a new field in
> > > >>>> the vm_page structure for the guest (vmm_dirty). I was looking into
> > > >>>> the vm_page structure and I don't think I can use a bit from the
> > > >>>> flags, aflags or oflags fields since I will end up mixing the fields'
> > > >>>> meanings.
> > > >>>> But, if I add a new field, it means I will use additional space which
> > > >>>> brings me to the question: this idea sounds a lot like the first idea
> > > >>>> that came to mind when I started implementing a migration feature for
> > > >>>> bhyve (my idea was to use a counter that would be incremented each
> > > >>>> time a page was written). But, I received negative feedback from a lot
> > > >>>> of people that it would imply to modify the vm_page structure for just
> > > >>>> one use-case (bhyve migration) and that the changes will affect the
> > > >>>> entire operating system and that my patch might not be accepted into
> > > >>>> the mainstream because of this. What do you think?
> > > >>>>
> > > >>>> Also, I am worried about the memory overhead this additional bit
> > > >>>> (field actually) will bring to the FreeBSD.
> > > >>>>
> > > >>>> Also, if I should use one of the unused bits in the vm_page flags
> > > >>>> (flags, oflags, aflags), it could happen that until my patch will be
> > > >>>> accepted, that particular bit will be use for other purposes and a
> > > >>>> conflict will occur maybe without realizing.
> > > >>>
> > > >>> In this hypothetical scheme, the VM subsystem would be aware of the
> > > >>> extra bit used to store the vmm_dirty state, so I don't think we need to
> > > >>> worry about collisions.  Both the "flags" and "oflags" fields have a
> > > >>> number of spare bits.  We definitely want to avoid adding any new
> > > >>> fields to struct vm_page, indeed.
> > > >>>
> > > >>>> 2. I am not very well documented regarding the physical page tables
> > > >>>> that a host would normally use. For the EPTs I know that those bits
> > > >>>> are "sticky" (once set, only the software can unset them - the
> > > >>>> hardware will only set them and never unset). Is it the same thing for
> > > >>>> the normal page tables and only the software can unset them? Or are
> > > >>>> there any other hardware mechanisms (like MMU) able to unset them?
> > > >>>
> > > >>> The access and modified bits (PG_A and PG_M) are only ever cleared by
> > > >>> software, regardless of the page table format.
> > > >>>
> > > >>>> 3. Also, are the set/unset operation strictly related to the vm_page
> > > >>>> structure? Is there any (software) mechanism that can set and unset
> > > >>>> them and the virtual memory system to not update the dirty flag?
> > > >>>
> > > >>> I don't quite follow your question, but I'll try to explain the
> > > >>> interaction between the pmap and the representation of the accessed
> > > >>> and modified bits in struct vm_page.  For the modified bit of mappings
> > > >>> of a page m, m->dirty just caches the value of the PG_M.  If I want to
> > > >>> ask the question, "is m dirty?" and m->dirty != 0, then I'm done.
> > > >>> If m->dirty == 0, I must called pmap_is_modified(m) to see if any
> > > >>> mapping of m has PG_M set, and if so, the value is propagated to
> > > >>> m->dirty.  The process of cleaning a page has multiple steps:
> > > >>>
> > > >>> 1. Busy the page
> > > >>> 2. Mark its mappings as read-only and clear PG_M from any mappings
> > > >>> 3. Write the page's contents to disk
> > > >>> 4. Set m->dirty = 0 (assuming the write was successful)
> > > >>> 5. Unbusy the page
> > > >>>
> > > >>> Step 2 will trigger a page fault on any attempt to write to the page.
> > > >>> The page fault handler will sleep while the page is busy, so any thread
> > > >>> attempting to simultaneously modify the page must wait until step 5, at
> > > >>> which point it will proceed and upgrade the mapping's permission from
> > > >>> read-only to read-write.  Once the write completes, we know that the
> > > >>> contents of m are identical to its copy on disk, so the page can be
> > > >>> marked clean.
> > > >>>
> > > >>> The accessed bit is handled a little bit differently; it is used mainly
> > > >>> to decide whether to evict a page.  The page daemon uses
> > > >>> pmap_ts_referenced() to check for and clear PG_A bits in mappings of a
> > > >>> page m.  If references are found, m->act_count is increased, ensuring
> > > >>> that the page won't be evicted right away.  PG_A is handled a bit
> > > >>> differently from PG_M in that PG_A is cleared once the reference has
> > > >>> been reflected in the vm_page struct, while PG_M is left set until the
> > > >>> page is cleaned.
> > > >>>
> > > >>>> 4. Also, I do not understand very well the way the virtual memory
> > > >>>> system behaves. I will write what I understood from our previous
> > > >>>> conversations and I want you to correct me if I am wrong.
> > > >>>>
> > > >>>> When a page is written, the hardware (MMU) will set the dirty/modified
> > > >>>> bit. From time to time, a virtual management routine will walk through
> > > >>>> all the pages from the page tables (all the page table entries to be
> > > >>>> more precise) and if the dirty flag is set, then the vm_page that
> > > >>>> points to that physical page will be updated (vm_page ->dirty will be
> > > >>>> set to VM_PAGE_BITS_ALL). Also, from time to time, other virtual
> > > >>>> memory system routines will check the dirty flag from vm_pages and
> > > >>>> will commit the changes and will unset the dirty bit (e.g. the page
> > > >>>> laundering process).
> > > >>>
> > > >>> Right, this is what I tried to describe above.
> > > >>>
> > > >>>> So, the whole idea is to introduce another field into the vm_page that
> > > >>>> will be set when the vm_page will be set as dirty, but not cleared
> > > >>>> when the vm_page dirty field will be cleared. Because the page
> > > >>>> walking/physical dirty bit check is happening from time to time, I
> > > >>>> have to do another manual page walk before starting the migration
> > > >>>> round to see if other pages have become dirty in the meanwhile (from
> > > >>>> the last virtual memory subsystem scan).
> > > >>>>
> > > >>>> Is that correct?
> > > >>>
> > > >>> Yes, you can't rely on the VM system to provide an up-to-date value of
> > > >>> the page's dirty state.  m->dirty is a lazy cache.
> > > >>>
> > > >>>> 5. Another issue that concerns me is the guest memory dual view. When
> > > >>>> the guest will write something in its memory, through the EPT
> > > >>>> mechanism, the corresponding page entry from the guest pmap (that has
> > > >>>> the EPT type) will be updated and the dirty flag will be set.
> > > >>>>
> > > >>>> But what happens when the host will write something into the guest
> > > >>>> memory (e.g. virtio related operations or when the guest wants to read
> > > >>>> something from disk)? A corresponding entry from the host page tables
> > > >>>> will point to the same physical page that has also an entry into the
> > > >>>> EPTs?
> > > >>>>
> > > >>>> I am kinda lost here and I am not sure if the proposed algorithm will
> > > >>>> cover the case when the host will write something into the guest's
> > > >>>> memory.
> > > >>>
> > > >>> I think it can.  Note that vm_page_test_dirty() calls
> > > >>> pmap_is_modified(), which searches _all_ mappings of m.  In particular,
> > > >>> if both the guest and host have mapped the same page, it should catch
> > > >>> modifications from either pmap.
> > > >>>
> > > >>>> 6. I was looking over the Mark's pseudocode idea, and I do not
> > > >>>> understand why the write access has to be removed from a page.
> > > >>>
> > > >>> The idea is to ensure that the page transfer process has a consistent
> > > >>> view of the page's contents.  It may be unnecessary in your
> > > >>> case since there is a final step where the guest is paused and we
> > > >>> perform a final scan for modifications.
> > > >>
> > > >> Thank you for your response. It clears a lot of the questions I had. I
> > > >> have an idea about how I could implement this. I'll start the
> > > >> implementation and I'll come back with an update as soon as I have
> > > >> something.
> > > >
> > > > I did have one later thought about this scheme which is that swapping has
> > > > no way to save/restore the VMM dirty bit.  I had originally thought about
> > > > just setting a flag on the VM object to enable setting the VMM dirty bit
> > > > and setting that on the VM object when it was created.  You would then
> > > > never swap a page that had the VMM dirty bit set to avoid losing that bit.
> > > > However, this would effectively prevent swapping of guest memory pages.
> > > > Instead, we could wait and only set the VM object flag at the start of a
> > > > migration and then make an initial sweep over the VM object marking all
> > > > the pages as "VMM dirty".  During the first scan of memory this would
> > > > result in copying all of guest memory, but that's ok.  After the first
> > > > pass, the "VMM dirty" bits would then work "correctly".
> > >
> > > Another option for swapping that might be simpler would be to just always
> > > set the VMM dirty bit when swapping a page back in unconditionally.  This
> > > would avoid the need for trying to block swapping of a page and is probably
> > > less work overall.  If you are under memory pressure this might degrade into
> > > a case where your live migration can't make forward progress because it
> > > swaps out clean pages and then has to swap them back in for the next sweep
> > > (because we will have to assume that any swapped out pages are dirty and
> > > always treat them as dirty during a scan), but that may also end up being a
> > > rare case in practice.
> > >
> > > --
> > > John Baldwin
> > >
> >
> > Hi all,
> >
> > As I've discussed with John in the previous bhyve call, I'm writing an
> > email on this thread to inform you all about my progress towards
> > implementing a live migration feature in bhyve.
> >
> > I've pretty much implemented the framework, but I have some issues
> > regarding the live migration procedure behaviour.
> >
> > It should be good to inform you that I have some constraints regarding
> > the actual implementation:
> > - the guest should have wired memory (as John intuited  in a previous
> > email, there are some issues regarding some pages that are not present
> > - for some indexes, vm_page_lookup [1] returns NULL; Considering
> > Matthew's suggestion, I've constrained the live migration feature for
> > wired memory);
> > - the guest should have the memory size less than the lowmem_limit (I
> > think it is something around 3GB right now). This is just for
> > commodity, if I could correctly live migrate the lowmem segment for a
> > guest, then I could quite easily extend the implementation to the
> > highmem segment.
> >
> > ======== ALGORITHM ===========
> > This being said, the algorithm should do the following things:
> > send_memory [2]:
> >  1. // First Round - send all pages
> >  2. // use an array to keep the pages that should be migrated
> >  3. page_list_indexes = all_guest_pages();
> >  4. lock_all_vCPUs();
> >  5. /* We've already saved the pages that should be migrated, so now
> > we could clear the dirty bits for all pages. (see later
> > observations)*/
> >  6. clear_migration_dirty_bits(); [3]
> >  7. unlock_all_vCPUs();
> >  8. send_pages(); /* see bellow */
> >  9
> > 10. for each non final round do
> > 11.    page_list_indexes = []
> > 12.    lock_all_vCPUs()
> > 13.    search_dirty_pages(page_list_indexes); [4]
> > 14.    clear_migration_dirty_bits();
> > 15.    unlock_all_vCPUs();
> > 16.    send_pages(); /* see bellow */
> > 17.end for
> > 18.
> > 19. // Final Round
> > 20. page_list_indexes = []
> > 21. lock_all_vCPUs()
> > 22. search_dirty_pages(page_list_indexes);
> > 23. send_pages();
> > 24. ... / send CPU state, kernel structs' state, devices' state
> > 25. freeze CPUs; unlock_all_vCPUs() and finish the source guest
> >
> > send_pages [5]:
> >  1. while (has_pages_to_be_sent)
> >  2.     retrieve_N_pages_from_kernel [6]
> >  3.     send(pages, /* through */ socket)
> >  4.     update(has_pages_to_be_sent)
> >  5. end while
> >
> > For the receiving part, the algorithm is simpler: I start a guest, and
> > right before the virtual cpus are spinning up, I wait for the
> > migration info to be received and update the pages accordingly [7].
> >
> > ======== TESTS AND RESULTS ===========
> >
> > As for the tests, I've tested live migration using the following two scenarios:
> > Scenario 1: power on the guest; when login prompt appears live migrate the vm;
> > Scenario 2: power on the guest; login as a user (root); execute while
> > true; do echo $((i=i+1)); sleep 1; done;
> >
> > As for the results:
> > - When testing Scenario 1, all the guest seemed to be migrated well
> > (although sometimes, after the login, the guest would fail - kernel
> > panic in guest)
> > - When testing Scenario 2 and using a 256MB guest (or a 512MB guest),
> > the migration was completed and the guest would continue executing the
> > script from where it was stopped by the migration procedure
> > - When testing Scenario 2 using a 1GB guest (or a 2GB guest), the
> > migration fails, sometimes the sleep get stuck, sometimes I get kernel
> > panic in guest ("supervisor page not " or  from vmspace_exit() and
> > pmap_remove_pages() and so on).
> >
> > It seems that there are some bugs into the implementation.
> >
> > ========== QUESTIONS ==========
> >
> > Sorry for the long intro, but I think that I should give details about
> > the implementation so you could have a bigger picture about my
> > questions.
> > First of all, I would like to know that, if at least conceptually, the
> > algorithm is fine (even if it can be improved).
> >
> > Second of all, I have some doubts regarding the way I should clean the
> > migration dirty bits. I need a method that would clean the migration
> > dirty bit and as well, the physical modified bit (so I could track the
> > page level modifications that happen between two rounds). In the
> > current implementation, I use vm_object_page_clean that should clean
> > all the object pages (I've seen that it should sync the memory).
> >
> > Another approach I was trying to implement was to clean the migration
> > bit and physical modified flag for each page after I copy it from the
> > kernel space into the user space in order to migrate it.  I was trying
> > to use the pmap_remove_write() as Mark suggested, but it seems that I
> > get a host kernel panic this way (something related to the wired
> > memory: "panic: pmap_demote_pde: page table page for a wired mapping
> > is missing". The backtrace shows that the panic happens when the
> > get_page function calls pmap_remove_write and pmap_remove_write calls
> > pmap_demote_pde_locked.). I've also tried pmap_clear_modify() but it
> > didn't help a lot.
> >
> > So, what do you think? What should I use? I know that in the current
> > implementation it may happen to clean the dirty bit for a page, that
> > page to be modified in the meanwhile, transfer it and then transfer it
> > again in the next round, but it should add redundancy into the code,
> > not errors.
> >
> >
> > =========== LINKS ==============
> >
> > [1] https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_page.c#L1541
> > [2] Live Migration - Send Memory:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/usr.sbin/bhyve/migration.c#L2294
> > [3] Clear Guest Dirty Bits:
> >      [3.1] Call vm_page_clear_vmm_dirty_bit:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_page.c#L1354
> > for each of the guest pages
> >      [3.2] Call vm_object_page_clean(object, 0, 0, OBJPC_SYNC); for
> > the object that contains the lowmem segment. The vm_object_page_clean
> > function is already implemented in FreeBSD:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_object.c#L883
> >              I know that this is not the intended way of using this
> > function, but I needed something that should clear the physical
> > modified bit without affecting the way the virtual memory subsystem
> > behaves.
> >      [3.3] Here is the graph of vm_object_page_clean function all of
> > its usages: http://www.leidinger.net/FreeBSD/dox/vm/html/d4/dfe/vm__object_8h.html#acb8268e5afa032b6213738cceba196a2
> > [4] Search dirty pages:
> >      [4.1] iterate through all of the guest pages; check if page is
> > dirty and if so, update field in array accordingly:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_radix.c#L827
> >      [4.2] to test if page is dirty, vm_page_test_vmm_dirty is called:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_page.c#L1360
> >      [4.3] vm_page_test_vmm_dirty calls vm_page_test_dirty that could
> > call vm_page_dirty and updates VPO_VMM_DIRTY and dirty flag:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_page.h#L735
> > [5] https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/usr.sbin/bhyve/migration.c#L2119
> > [6] Copy object pages from kernel space to userspace
> >      [6.1] Call vm_object_get_page for each page that I am interested
> > in: https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/amd64/vmm/vmm.c#L3362
> >      [6.2] vm_object_get_page implementation:
> > https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_object.c#L2463
> > [7] https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration_dev/sys/vm/vm_object.c#L2483
> >
> > Thank you very much for your help,
> > Elena
>
>
> Hi all,
>
> I have some issues related to the dirty bit mechanism to detect the
> memory changes between two migration rounds and I wanted to know if
> anyone can give me some ideas about the way I should debug this. The
> idea is that the migrated memory and the guest memory differ at the
> end of the live migration.
>
> The dirty bit implementation is the following:
> - use a bit in vm_page->oflags (mask: 0x80)
> - update bit in vm_page_dirty() and vm_page_dirty_KBI.
> - when searching pages for migration:
>    for each page:
>           vm_page_test_dirty();
>           test vmm dirty bit
>           if bit is set then page is add in a to_be_transmitted list
> - when sending pages
>     for each page in to_be_transmitted list:
>           xbusy_page;
>           copy_page_to_local_buffer();
>           xunbusy_page;
>
> As you may see, I do not clear the virtual machine dirty bit after I
> copy a page. I think that this is not relevant right now. All the
> pages that were ever modified since the guest has been started should
> be transferred, but the final memory should not be affected. However,
> in this case, the guest memory and the migrated memory differ which
> led me to the idea that the memory changes detection algorithm does
> not work properly. Moreover, when the number of rounds is 1 (warm
> migration), then the migration is correct and the guest memory and the
> migrated memory are the same, so I will consider (for this moment)
> that the getter and setter functions for guest is OK.
>

Hi all,

I'm writing this e-mail to keep you in the loop with the live
migration progress and to ask for advice regarding the process of
cleaning the dirty bit. At Rod Grimes's suggestion, I added the
virtualization list at CC.

I had some issues with the vmm_dirty bit (a bit from oflags) and
initially I thought that there some issues related to the fact that in
some cases oflags is directly updated (oflags = <smth>) and with the
fact that the dirty bit is set otherwise that using vm_page_dirty(),
but it seems that the issues I had were from other sources and now,
after cleaning the code, and refactored a bit, it seems to work fine.
Some of the code snippets that I thought that could affect the
migration are at [1].

However, if I dump the guest memory after the live migration (the
guest is not yet started after migration) and compare it with the
source guest's ctx->baseaddr, there are some differences and I don't
know from where they may come from.

One of the things I must implement is related to the dirty bit
mechanism. For now, I do not clear the dirty bit after I migrate a
page. This is not a wrong implementation because in every round, I
migrate all the pages that have been modified ever (since the guest
was started), but it is not an optimal implementation. To clear the
dirty bit, I've tried different mechanism such as:
- pmap_remove_write - kernel panic: pde entry not found for a wired mapped page
- vm_object_page_clean
- vm_map_sync -> kernel panic
- pmap_protect - remove write permission, copy page and add write
permission -> kernel panic; pmap_demote_pde: page table page for a
wired mapping is missing

Any idea about a way of forcing a page to be cleared (so the physical
dirty bit to be cleared)?

[1] https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_page.c#L2627
https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_fault.c#L1761
https://github.com/FreeBSD-UPB/freebsd/blob/master/sys/vm/vm_page.c#L1912
https://github.com/FreeBSD-UPB/freebsd/blob/master/sys/vm/vm_page.c#L2105
https://github.com/FreeBSD-UPB/freebsd/blob/master/sys/vm/vm_page.c#L2117
https://github.com/FreeBSD-UPB/freebsd/blob/master/sys/vm/vm_page.c#L1204

Thank you,
Elena
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: bhyve guest's memory representation & live migration using COW

Mark Johnston-2
On Mon, Mar 04, 2019 at 10:52:44AM +0200, Elena Mihailescu wrote:

> I'm writing this e-mail to keep you in the loop with the live
> migration progress and to ask for advice regarding the process of
> cleaning the dirty bit. At Rod Grimes's suggestion, I added the
> virtualization list at CC.
>
> I had some issues with the vmm_dirty bit (a bit from oflags) and
> initially I thought that there some issues related to the fact that in
> some cases oflags is directly updated (oflags = <smth>) and with the
> fact that the dirty bit is set otherwise that using vm_page_dirty(),
> but it seems that the issues I had were from other sources and now,
> after cleaning the code, and refactored a bit, it seems to work fine.
> Some of the code snippets that I thought that could affect the
> migration are at [1].
>
> However, if I dump the guest memory after the live migration (the
> guest is not yet started after migration) and compare it with the
> source guest's ctx->baseaddr, there are some differences and I don't
> know from where they may come from.
>
> One of the things I must implement is related to the dirty bit
> mechanism. For now, I do not clear the dirty bit after I migrate a
> page.

Note that there is code that only calls vm_page_dirty(m) if
m->dirty != VM_PAGE_BITS_ALL.  One example is in vm_page_advise().
So if the page is dirty from the kernel's point of view and you clear
the VMM dirty bit, subsequent modifications may not be propagated to the
vm_page state.

> This is not a wrong implementation because in every round, I
> migrate all the pages that have been modified ever (since the guest
> was started), but it is not an optimal implementation. To clear the
> dirty bit, I've tried different mechanism such as:
> - pmap_remove_write - kernel panic: pde entry not found for a wired mapped page
> - vm_object_page_clean
> - vm_map_sync -> kernel panic
> - pmap_protect - remove write permission, copy page and add write
> permission -> kernel panic; pmap_demote_pde: page table page for a
> wired mapping is missing
>
> Any idea about a way of forcing a page to be cleared (so the physical
> dirty bit to be cleared)?

I'm not sure exactly what you mean, but pmap_clear_modify(m) will clear
the modification bit for mappings of m.  It seems like you want
something roughly like:

        vm_page_xbusy(m);
        if (pmap_is_modified(m))
                m->dirty = m->vmm_dirty = VM_PAGE_BITS_ALL;
        pmap_clear_modify(m);
        vm_page_xunbusy(m);

Or do we need to restrict the operation to a specific mapping of m?
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to "[hidden email]"