numa and taskqueues

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

numa and taskqueues

Emeric POUPON-2
Hello,

I have made a review to boost ipsec performance when very few flows are involved: https://reviews.freebsd.org/D10680 (reviews would be appreciated btw!)
The idea is to dispatch the crypto jobs using a taskqueue (with nb threads = nbcpus), details are in the review.

However, this does not scale well on multi socket architectures (ex: 2*6 cores), a lot of time is wasted in the locks.

For testing purposes, I created as many taskqueues as domains and I modified the taskqueue_start_threads function to specify a cpuset_t mask.
The idea here is to stay on the same domain to dispatch the crypto jobs and to notify back the crypto users.
This gives quite good performance so it seems to be an promising way.

Now the question is: how can I make the taskqueues "domain aware"?
Do I have to add some logic in crypto(9) or could this be abstracted in some other part of the kernel?
Another annoying part is the kprocs used by the return queues. We would also have to bind them to a single domain. How?

What do you think?

Emeric
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Adrian Chadd-4
Hi,

I've been worried about the trend to create ncpu*taskqueue or
ndomain*taskqueue for things unless we really need the priority /
preemption behaviour. Otherwise we will just end up with a lot of
pcpu/pdomain taskqueues that sit idly and/or compete infficiently.

Anyway - I think it'd be nice to have domain aware and pcpu aware
taskqueues so we can eventually migrate to a taskqueue group model of
"one top level things for net processing" for devices to share, etc,
etc. But for the short term just prototype it with some thin API in
crypto that wraps the taskqueue / kproc work so it gets done, then
push that work out for review/evaluation. if it does indeed work the
way you intend, we can try to use it as a template for a higher level,
shared taskqueue thing.

Thanks,


-adrian


On 19 May 2017 at 00:13, Emeric POUPON <[hidden email]> wrote:

> Hello,
>
> I have made a review to boost ipsec performance when very few flows are involved: https://reviews.freebsd.org/D10680 (reviews would be appreciated btw!)
> The idea is to dispatch the crypto jobs using a taskqueue (with nb threads = nbcpus), details are in the review.
>
> However, this does not scale well on multi socket architectures (ex: 2*6 cores), a lot of time is wasted in the locks.
>
> For testing purposes, I created as many taskqueues as domains and I modified the taskqueue_start_threads function to specify a cpuset_t mask.
> The idea here is to stay on the same domain to dispatch the crypto jobs and to notify back the crypto users.
> This gives quite good performance so it seems to be an promising way.
>
> Now the question is: how can I make the taskqueues "domain aware"?
> Do I have to add some logic in crypto(9) or could this be abstracted in some other part of the kernel?
> Another annoying part is the kprocs used by the return queues. We would also have to bind them to a single domain. How?
>
> What do you think?
>
> Emeric
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Emeric POUPON-2
Hi,

> Anyway - I think it'd be nice to have domain aware and pcpu aware
> taskqueues so we can eventually migrate to a taskqueue group model of
> "one top level things for net processing" for devices to share, etc,
> etc. But for the short term just prototype it with some thin API in
> crypto that wraps the taskqueue / kproc work so it gets done, then
> push that work out for review/evaluation. if it does indeed work the
> way you intend, we can try to use it as a template for a higher level,
> shared taskqueue thing.

It looks like it is somewhat mandatory to modify the taskqueue API to pin threads to the
correct CPUs. The logic to define which CPU need to be set is another story that indeed can first
be implemented in crypto(9).

By the way:
1/ do you have some pointers on domain enumeration and other numa related code?
2/ about https://reviews.freebsd.org/D10680, I think it would be great to have this commited as a first step.
Since it seems to be stuck, maybe I can add more people on this. Any suggestion?

Emeric

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Adrian Chadd-2
On 30 May 2017 at 03:56, Emeric POUPON <[hidden email]> wrote:

> Hi,
>
>> Anyway - I think it'd be nice to have domain aware and pcpu aware
>> taskqueues so we can eventually migrate to a taskqueue group model of
>> "one top level things for net processing" for devices to share, etc,
>> etc. But for the short term just prototype it with some thin API in
>> crypto that wraps the taskqueue / kproc work so it gets done, then
>> push that work out for review/evaluation. if it does indeed work the
>> way you intend, we can try to use it as a template for a higher level,
>> shared taskqueue thing.
>
> It looks like it is somewhat mandatory to modify the taskqueue API to pin threads to the
> correct CPUs. The logic to define which CPU need to be set is another story that indeed can first
> be implemented in crypto(9).
>
> By the way:
> 1/ do you have some pointers on domain enumeration and other numa related code?

Sorry, I'm a bit too busy with other things to dive in right now :(

> 2/ about https://reviews.freebsd.org/D10680, I think it would be great to have this commited as a first step.
> Since it seems to be stuck, maybe I can add more people on this. Any suggestion?

Well, what's with the ~ 8% performance decrease? Do you know what's
going on? For a "we're parallelising IPSEC operations", seeing it get
slower with more flows is a bit concerning.

Thanks,




-adrian
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Emeric POUPON-2
Hi,

>
>> 2/ about https://reviews.freebsd.org/D10680, I think it would be great to have
>> this commited as a first step.
>> Since it seems to be stuck, maybe I can add more people on this. Any suggestion?
>
> Well, what's with the ~ 8% performance decrease? Do you know what's
> going on? For a "we're parallelising IPSEC operations", seeing it get
> slower with more flows is a bit concerning.
>
> Thanks,
>

Actually, there is a performance boost only when few flows are involved.
That's why this is not activated by default and a sysctl is here to enable the feature.

To sum up, the more different flows you process (both ciphered and unciphered), the more network queues are hit and the more CPU units are triggered from ipsec.
In this case, we indeed notice a loss, certainly due to the extra queing/reordering performed.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Adrian Chadd-2
On 30 May 2017 at 07:46, Emeric POUPON <[hidden email]> wrote:

> Hi,
>
>>
>>> 2/ about https://reviews.freebsd.org/D10680, I think it would be great to have
>>> this commited as a first step.
>>> Since it seems to be stuck, maybe I can add more people on this. Any suggestion?
>>
>> Well, what's with the ~ 8% performance decrease? Do you know what's
>> going on? For a "we're parallelising IPSEC operations", seeing it get
>> slower with more flows is a bit concerning.
>>
>> Thanks,
>>
>
> Actually, there is a performance boost only when few flows are involved.
> That's why this is not activated by default and a sysctl is here to enable the feature.
>
> To sum up, the more different flows you process (both ciphered and unciphered), the more network queues are hit and the more CPU units are triggered from ipsec.
> In this case, we indeed notice a loss, certainly due to the extra queing/reordering performed.

Can you dig into that a bit more? Do you know exactly what's going on?
eg, is it a "lock contention" problem? Is it a "stuff is context
switching, thus latency" problem? etc, etc.



-adrian
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Emeric POUPON-2

>>
>> Actually, there is a performance boost only when few flows are involved.
>> That's why this is not activated by default and a sysctl is here to enable the
>> feature.
>>
>> To sum up, the more different flows you process (both ciphered and unciphered),
>> the more network queues are hit and the more CPU units are triggered from
>> ipsec.
>> In this case, we indeed notice a loss, certainly due to the extra
>> queing/reordering performed.
>
> Can you dig into that a bit more? Do you know exactly what's going on?
> eg, is it a "lock contention" problem? Is it a "stuff is context
> switching, thus latency" problem? etc, etc.
>

Unfortunately I cannot tell you the exact reason right now.
I am sure there is no lock contention involved though (except of course when several domains are involved).
Did you expect such a dev to be enabled by default?

Emeric
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: numa and taskqueues

Adrian Chadd-2
On 31 May 2017 at 06:53, Emeric POUPON <[hidden email]> wrote:

>
>>>
>>> Actually, there is a performance boost only when few flows are involved.
>>> That's why this is not activated by default and a sysctl is here to enable the
>>> feature.
>>>
>>> To sum up, the more different flows you process (both ciphered and unciphered),
>>> the more network queues are hit and the more CPU units are triggered from
>>> ipsec.
>>> In this case, we indeed notice a loss, certainly due to the extra
>>> queing/reordering performed.
>>
>> Can you dig into that a bit more? Do you know exactly what's going on?
>> eg, is it a "lock contention" problem? Is it a "stuff is context
>> switching, thus latency" problem? etc, etc.
>>
>
> Unfortunately I cannot tell you the exact reason right now.
> I am sure there is no lock contention involved though (except of course when several domains are involved).
> Did you expect such a dev to be enabled by default?

Well, I'd really like to get to the bottom of these. :-P



-adrian
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"