RFC: New KPI for fast temporary single-page KVA mappings

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: New KPI for fast temporary single-page KVA mappings

Jason Harmening
Hi everyone,

I'd like to propose a couple of new pmap functions:
vm_offset_t pmap_quick_enter_page(vm_page_t m)
void pmap_quick_remove_page(vm_offset_t kva)

These functions will create and destroy a temporary, usually CPU-local
mapping of the specified page.  Where available, they will use the direct
map.  Otherwise, they will use a per-CPU pageframe that's allocated at boot.

Guarantees:
--Will not sleep
--Will not fail
--Safe to call under a non-spin lock or from an ithread

Restrictions:
--Not safe to call from interrupt filter or under a spin mutex on all arches
--Mappings should be held for as little time as possible; don't do any
locking or sleeping while holding a mapping
--Current implementation only guarantees a single page of mapping space
across all arches.  MI code should not make nested calls to
pmap_quick_enter_page().

My idea is that the first consumer of this would be busdma.  All non-iommu
implementations would use this for bounce buffer copies of pages that don't
have resident mappings.  Currently busdma uses physcopy[in|out] for
unmapped buffers, which on most arches uses sf_bufs that can sleep, making
bus_dmamap_sync() unsafe to call in a lot of cases.  busdma would also use
this for virtually-indexed cache maintenance on arm and mips.  It currently
ignores cache maintenance for buffers that don't have a KVA or resident UVA
mapping, which may not be correct for buffers that don't belong to curproc
or have cache-resident VAs on other cores.

I've created 2 Differential reviews:
https://reviews.freebsd.org/D3013: the implementation
https://reviews.freebsd.org/D3014: the kmod I've been using to test it

I'd like any and all feedback, both on the general approach and the
implementation details.  Some things to note on the implementation:
--I've intentionally avoided touching existing pmap code for the time
being.  Some of the new code could likely be shared with other pmap KPIs in
a lot of cases.
--I've structured the KPI to make it easy to extend to guarantee more than
one per-CPU page in the future.  I could see that being useful for copying
between pages, for example
--There's no immediate consumer for the sparc64 implementation, since
busdma there needs neither bounce buffers nor cache maintenance.
--I would very much like feedback and testing from experts on non-x86
arches.  I only have hardware to test the i386 and amd64 implementations;
I've only cross-compiled it for everything else.  Some of the non-x86
details, like the Book E powerpc TLB invalidation code, are a bit scary and
probably not quite right.

Thanks,
Jason
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: New KPI for fast temporary single-page KVA mappings

John Baldwin
On Tuesday, July 07, 2015 11:37:55 AM Jason Harmening wrote:

> Hi everyone,
>
> I'd like to propose a couple of new pmap functions:
> vm_offset_t pmap_quick_enter_page(vm_page_t m)
> void pmap_quick_remove_page(vm_offset_t kva)
>
> These functions will create and destroy a temporary, usually CPU-local
> mapping of the specified page.  Where available, they will use the direct
> map.  Otherwise, they will use a per-CPU pageframe that's allocated at boot.
>
> Guarantees:
> --Will not sleep
> --Will not fail
> --Safe to call under a non-spin lock or from an ithread
>
> Restrictions:
> --Not safe to call from interrupt filter or under a spin mutex on all arches
> --Mappings should be held for as little time as possible; don't do any
> locking or sleeping while holding a mapping
> --Current implementation only guarantees a single page of mapping space
> across all arches.  MI code should not make nested calls to
> pmap_quick_enter_page().
>
> My idea is that the first consumer of this would be busdma.  All non-iommu
> implementations would use this for bounce buffer copies of pages that don't
> have resident mappings.  Currently busdma uses physcopy[in|out] for
> unmapped buffers, which on most arches uses sf_bufs that can sleep, making
> bus_dmamap_sync() unsafe to call in a lot of cases.  busdma would also use
> this for virtually-indexed cache maintenance on arm and mips.  It currently
> ignores cache maintenance for buffers that don't have a KVA or resident UVA
> mapping, which may not be correct for buffers that don't belong to curproc
> or have cache-resident VAs on other cores.
>
> I've created 2 Differential reviews:
> https://reviews.freebsd.org/D3013: the implementation
> https://reviews.freebsd.org/D3014: the kmod I've been using to test it
>
> I'd like any and all feedback, both on the general approach and the
> implementation details.  Some things to note on the implementation:
> --I've intentionally avoided touching existing pmap code for the time
> being.  Some of the new code could likely be shared with other pmap KPIs in
> a lot of cases.
> --I've structured the KPI to make it easy to extend to guarantee more than
> one per-CPU page in the future.  I could see that being useful for copying
> between pages, for example
> --There's no immediate consumer for the sparc64 implementation, since
> busdma there needs neither bounce buffers nor cache maintenance.
> --I would very much like feedback and testing from experts on non-x86
> arches.  I only have hardware to test the i386 and amd64 implementations;
> I've only cross-compiled it for everything else.  Some of the non-x86
> details, like the Book E powerpc TLB invalidation code, are a bit scary and
> probably not quite right.

I do think something like this would be useful.  What I had wanted to do was
to add a 'memory cursor' to go along with memory descriptors.  The idea would
be that you can use a cursor to iterate over any descriptor, and that one of
the options when creating a virtual address cursor was to ask it to preallocate
any resources it needs at creation time (e.g. a page of KVA on platforms without
a direct map).   Then if a driver or GEOM module needs to walk over arbitrary
I/O buffers that come down via virtual addresses, it could allocate one or more
cursors.

I have a partial implementation of cursors in a p4 branch, but it of course is
missing the hard part of VA mappings without a direct map.  However, this would
let you have N of these things and to also control the lifecycle of the temporary
KVA addresses instead of having a fixed set.

--
John Baldwin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: New KPI for fast temporary single-page KVA mappings

Konstantin Belousov
On Tue, Jul 14, 2015 at 11:30:23AM -0700, John Baldwin wrote:

> On Tuesday, July 07, 2015 11:37:55 AM Jason Harmening wrote:
> > Hi everyone,
> >
> > I'd like to propose a couple of new pmap functions:
> > vm_offset_t pmap_quick_enter_page(vm_page_t m)
> > void pmap_quick_remove_page(vm_offset_t kva)
> >
> > These functions will create and destroy a temporary, usually CPU-local
> > mapping of the specified page.  Where available, they will use the direct
> > map.  Otherwise, they will use a per-CPU pageframe that's allocated at boot.
> >
> > Guarantees:
> > --Will not sleep
> > --Will not fail
> > --Safe to call under a non-spin lock or from an ithread
> >
> > Restrictions:
> > --Not safe to call from interrupt filter or under a spin mutex on all arches
> > --Mappings should be held for as little time as possible; don't do any
> > locking or sleeping while holding a mapping
> > --Current implementation only guarantees a single page of mapping space
> > across all arches.  MI code should not make nested calls to
> > pmap_quick_enter_page().
> >
> > My idea is that the first consumer of this would be busdma.  All non-iommu
> > implementations would use this for bounce buffer copies of pages that don't
> > have resident mappings.  Currently busdma uses physcopy[in|out] for
> > unmapped buffers, which on most arches uses sf_bufs that can sleep, making
> > bus_dmamap_sync() unsafe to call in a lot of cases.  busdma would also use
> > this for virtually-indexed cache maintenance on arm and mips.  It currently
> > ignores cache maintenance for buffers that don't have a KVA or resident UVA
> > mapping, which may not be correct for buffers that don't belong to curproc
> > or have cache-resident VAs on other cores.
> >
> > I've created 2 Differential reviews:
> > https://reviews.freebsd.org/D3013: the implementation
> > https://reviews.freebsd.org/D3014: the kmod I've been using to test it
> >
> > I'd like any and all feedback, both on the general approach and the
> > implementation details.  Some things to note on the implementation:
> > --I've intentionally avoided touching existing pmap code for the time
> > being.  Some of the new code could likely be shared with other pmap KPIs in
> > a lot of cases.
> > --I've structured the KPI to make it easy to extend to guarantee more than
> > one per-CPU page in the future.  I could see that being useful for copying
> > between pages, for example
> > --There's no immediate consumer for the sparc64 implementation, since
> > busdma there needs neither bounce buffers nor cache maintenance.
> > --I would very much like feedback and testing from experts on non-x86
> > arches.  I only have hardware to test the i386 and amd64 implementations;
> > I've only cross-compiled it for everything else.  Some of the non-x86
> > details, like the Book E powerpc TLB invalidation code, are a bit scary and
> > probably not quite right.
>
> I do think something like this would be useful.  What I had wanted to do was
> to add a 'memory cursor' to go along with memory descriptors.  The idea would
> be that you can use a cursor to iterate over any descriptor, and that one of
> the options when creating a virtual address cursor was to ask it to preallocate
> any resources it needs at creation time (e.g. a page of KVA on platforms without
> a direct map).   Then if a driver or GEOM module needs to walk over arbitrary
> I/O buffers that come down via virtual addresses, it could allocate one or more
> cursors.
>
> I have a partial implementation of cursors in a p4 branch, but it of course is
> missing the hard part of VA mappings without a direct map.  However, this would
> let you have N of these things and to also control the lifecycle of the temporary
> KVA addresses instead of having a fixed set.
>

I do not quite agree that the proposed KPI and your description of cursors
have much in common.

From what I read above, the implementation of the temporal VA mappings
for cursors should be easy. You need to allocate VA at the time of
cursor initialization, and then do pmap_qenter() when needed. In fact,
it would be not trivial for the direct map case, to optimize out
unneeded VA allocation and qenter.

The proposed KPI has rather different goals, it does not need any
pre-setup for use, but still it can be used and guarantees to not fail
from the hard contexts, like swi or interrupt threads (in particular, it
works in the busdma callback context).

My opinion is that the KPI and cursors are for different goals. Might
be, the KPI could be used as a building foundation for some cursor'
functionality.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RFC: New KPI for fast temporary single-page KVA mappings

Jason Harmening
Yeah, I see both this and cursors as being useful for different purposes.
Code that just needs to do a simple, quick operation on a page and doesn't
want to worry about setup/teardown and synchronization (or needs to work
under low-KVA conditions) could use pmap_quick_enter_page().  More complex
code, especially code that needs a lot of pages in-flight at the same time,
could use cursors.

As kib mentioned, kva_alloc() + pmap_qenter() seems like a ready-made MI
cursor implementation.  If you want to optimize for direct maps, you could
make MD cursor implementations that bypass those steps; that's very roughly
what physcopy* does.  But, I'm not sure if that would be worth the
trouble.  For the most part, the arches that have comprehensive direct maps
are also 64-bit arches where KVA pageframes are the most plentiful.

Would it make sense to reimplement sf_bufs as a pool of cursors?

On Wed, Jul 15, 2015 at 9:15 AM, Konstantin Belousov <[hidden email]>
wrote:

> On Tue, Jul 14, 2015 at 11:30:23AM -0700, John Baldwin wrote:
> > On Tuesday, July 07, 2015 11:37:55 AM Jason Harmening wrote:
> > > Hi everyone,
> > >
> > > I'd like to propose a couple of new pmap functions:
> > > vm_offset_t pmap_quick_enter_page(vm_page_t m)
> > > void pmap_quick_remove_page(vm_offset_t kva)
> > >
> > > These functions will create and destroy a temporary, usually CPU-local
> > > mapping of the specified page.  Where available, they will use the
> direct
> > > map.  Otherwise, they will use a per-CPU pageframe that's allocated at
> boot.
> > >
> > > Guarantees:
> > > --Will not sleep
> > > --Will not fail
> > > --Safe to call under a non-spin lock or from an ithread
> > >
> > > Restrictions:
> > > --Not safe to call from interrupt filter or under a spin mutex on all
> arches
> > > --Mappings should be held for as little time as possible; don't do any
> > > locking or sleeping while holding a mapping
> > > --Current implementation only guarantees a single page of mapping space
> > > across all arches.  MI code should not make nested calls to
> > > pmap_quick_enter_page().
> > >
> > > My idea is that the first consumer of this would be busdma.  All
> non-iommu
> > > implementations would use this for bounce buffer copies of pages that
> don't
> > > have resident mappings.  Currently busdma uses physcopy[in|out] for
> > > unmapped buffers, which on most arches uses sf_bufs that can sleep,
> making
> > > bus_dmamap_sync() unsafe to call in a lot of cases.  busdma would also
> use
> > > this for virtually-indexed cache maintenance on arm and mips.  It
> currently
> > > ignores cache maintenance for buffers that don't have a KVA or
> resident UVA
> > > mapping, which may not be correct for buffers that don't belong to
> curproc
> > > or have cache-resident VAs on other cores.
> > >
> > > I've created 2 Differential reviews:
> > > https://reviews.freebsd.org/D3013: the implementation
> > > https://reviews.freebsd.org/D3014: the kmod I've been using to test it
> > >
> > > I'd like any and all feedback, both on the general approach and the
> > > implementation details.  Some things to note on the implementation:
> > > --I've intentionally avoided touching existing pmap code for the time
> > > being.  Some of the new code could likely be shared with other pmap
> KPIs in
> > > a lot of cases.
> > > --I've structured the KPI to make it easy to extend to guarantee more
> than
> > > one per-CPU page in the future.  I could see that being useful for
> copying
> > > between pages, for example
> > > --There's no immediate consumer for the sparc64 implementation, since
> > > busdma there needs neither bounce buffers nor cache maintenance.
> > > --I would very much like feedback and testing from experts on non-x86
> > > arches.  I only have hardware to test the i386 and amd64
> implementations;
> > > I've only cross-compiled it for everything else.  Some of the non-x86
> > > details, like the Book E powerpc TLB invalidation code, are a bit
> scary and
> > > probably not quite right.
> >
> > I do think something like this would be useful.  What I had wanted to do
> was
> > to add a 'memory cursor' to go along with memory descriptors.  The idea
> would
> > be that you can use a cursor to iterate over any descriptor, and that
> one of
> > the options when creating a virtual address cursor was to ask it to
> preallocate
> > any resources it needs at creation time (e.g. a page of KVA on platforms
> without
> > a direct map).   Then if a driver or GEOM module needs to walk over
> arbitrary
> > I/O buffers that come down via virtual addresses, it could allocate one
> or more
> > cursors.
> >
> > I have a partial implementation of cursors in a p4 branch, but it of
> course is
> > missing the hard part of VA mappings without a direct map.  However,
> this would
> > let you have N of these things and to also control the lifecycle of the
> temporary
> > KVA addresses instead of having a fixed set.
> >
>
> I do not quite agree that the proposed KPI and your description of cursors
> have much in common.
>
> From what I read above, the implementation of the temporal VA mappings
> for cursors should be easy. You need to allocate VA at the time of
> cursor initialization, and then do pmap_qenter() when needed. In fact,
> it would be not trivial for the direct map case, to optimize out
> unneeded VA allocation and qenter.
>
> The proposed KPI has rather different goals, it does not need any
> pre-setup for use, but still it can be used and guarantees to not fail
> from the hard contexts, like swi or interrupt threads (in particular, it
> works in the busdma callback context).
>
> My opinion is that the KPI and cursors are for different goals. Might
> be, the KPI could be used as a building foundation for some cursor'
> functionality.
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"