Memory allocation performance

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Memory allocation performance

Alexander Motin-3
Hi.

While profiling netgraph operation on UP HEAD router I have found that
huge amount of time it spent on memory allocation/deallocation:

         0.14  0.05  132119/545292      ip_forward <cycle 1> [12]
         0.14  0.05  133127/545292      fxp_add_rfabuf [18]
         0.27  0.10  266236/545292      ng_package_data [17]
[9]14.1 0.56  0.21  545292         uma_zalloc_arg [9]
         0.17  0.00  545292/1733401     critical_exit <cycle 2> [98]
         0.01  0.00  275941/679675      generic_bzero [68]
         0.01  0.00  133127/133127      mb_ctor_pack [103]

         0.15  0.06  133100/545266      mb_free_ext [22]
         0.15  0.06  133121/545266      m_freem [15]
         0.29  0.11  266236/545266      ng_free_item [16]
[8]15.2 0.60  0.23  545266         uma_zfree_arg [8]
         0.17  0.00  545266/1733401     critical_exit <cycle 2> [98]
         0.00  0.04  133100/133100      mb_dtor_pack [57]
         0.00  0.00  134121/134121      mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that
left are practically unavoidable. But even after this kgmon tells that
30% of CPU time consumed by memory management.

So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it in
some places with preallocated TAILQ of required memory blocks protected
by mutex and according to profiler I have got _much_ better results.
Will it be a good practice to replace relatively small UMA zones with
preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does
it cost so much that it costs 30% CPU time on UP router?

Thanks!

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
Alexander Motin пишет:
> While profiling netgraph operation on UP HEAD router I have found that
> huge amount of time it spent on memory allocation/deallocation:

I have forgotten to tell that it was mostly GENERIC kernel just built
without INVARIANTS, WITNESS and SMP but with 'profile 2'.

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Kris Kennaway-3
In reply to this post by Alexander Motin-3
Alexander Motin wrote:

> Hi.
>
> While profiling netgraph operation on UP HEAD router I have found that
> huge amount of time it spent on memory allocation/deallocation:
>
>         0.14  0.05  132119/545292      ip_forward <cycle 1> [12]
>         0.14  0.05  133127/545292      fxp_add_rfabuf [18]
>         0.27  0.10  266236/545292      ng_package_data [17]
> [9]14.1 0.56  0.21  545292         uma_zalloc_arg [9]
>         0.17  0.00  545292/1733401     critical_exit <cycle 2> [98]
>         0.01  0.00  275941/679675      generic_bzero [68]
>         0.01  0.00  133127/133127      mb_ctor_pack [103]
>
>         0.15  0.06  133100/545266      mb_free_ext [22]
>         0.15  0.06  133121/545266      m_freem [15]
>         0.29  0.11  266236/545266      ng_free_item [16]
> [8]15.2 0.60  0.23  545266         uma_zfree_arg [8]
>         0.17  0.00  545266/1733401     critical_exit <cycle 2> [98]
>         0.00  0.04  133100/133100      mb_dtor_pack [57]
>         0.00  0.00  134121/134121      mb_dtor_mbuf [111]
>
> I have already optimized all possible allocation calls and those that
> left are practically unavoidable. But even after this kgmon tells that
> 30% of CPU time consumed by memory management.
>
> So I have some questions:
> 1) Is it real situation or just profiler mistake?
> 2) If it is real then why UMA is so slow? I have tried to replace it in
> some places with preallocated TAILQ of required memory blocks protected
> by mutex and according to profiler I have got _much_ better results.
> Will it be a good practice to replace relatively small UMA zones with
> preallocated queue to avoid part of UMA calls?
> 3) I have seen that UMA does some kind of CPU cache affinity, but does
> it cost so much that it costs 30% CPU time on UP router?

Make sure you have INVARIANTS disabled, it has a high performance cost
in UMA.

Kris
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Julian Elischer
In reply to this post by Alexander Motin-3
Alexander Motin wrote:

> Hi.
>
> While profiling netgraph operation on UP HEAD router I have found that
> huge amount of time it spent on memory allocation/deallocation:
>
>         0.14  0.05  132119/545292      ip_forward <cycle 1> [12]
>         0.14  0.05  133127/545292      fxp_add_rfabuf [18]
>         0.27  0.10  266236/545292      ng_package_data [17]
> [9]14.1 0.56  0.21  545292         uma_zalloc_arg [9]
>         0.17  0.00  545292/1733401     critical_exit <cycle 2> [98]
>         0.01  0.00  275941/679675      generic_bzero [68]
>         0.01  0.00  133127/133127      mb_ctor_pack [103]
>
>         0.15  0.06  133100/545266      mb_free_ext [22]
>         0.15  0.06  133121/545266      m_freem [15]
>         0.29  0.11  266236/545266      ng_free_item [16]
> [8]15.2 0.60  0.23  545266         uma_zfree_arg [8]
>         0.17  0.00  545266/1733401     critical_exit <cycle 2> [98]
>         0.00  0.04  133100/133100      mb_dtor_pack [57]
>         0.00  0.00  134121/134121      mb_dtor_mbuf [111]
>
> I have already optimized all possible allocation calls and those that
> left are practically unavoidable. But even after this kgmon tells that
> 30% of CPU time consumed by memory management.
>
> So I have some questions:
> 1) Is it real situation or just profiler mistake?
> 2) If it is real then why UMA is so slow? I have tried to replace it in
> some places with preallocated TAILQ of required memory blocks protected
> by mutex and according to profiler I have got _much_ better results.
> Will it be a good practice to replace relatively small UMA zones with
> preallocated queue to avoid part of UMA calls?
> 3) I have seen that UMA does some kind of CPU cache affinity, but does
> it cost so much that it costs 30% CPU time on UP router?

given this information, I would add an 'item cache' in ng_base.c
(hmm do I already have one?)


>
> Thanks!
>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Kris Kennaway-3
In reply to this post by Alexander Motin-3
Alexander Motin wrote:
> Alexander Motin пишет:
>> While profiling netgraph operation on UP HEAD router I have found that
>> huge amount of time it spent on memory allocation/deallocation:
>
> I have forgotten to tell that it was mostly GENERIC kernel just built
> without INVARIANTS, WITNESS and SMP but with 'profile 2'.
>

What is 'profile 2'?

Kris
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
Kris Kennaway пишет:
> Alexander Motin wrote:
>> Alexander Motin пишет:
>>> While profiling netgraph operation on UP HEAD router I have found
>>> that huge amount of time it spent on memory allocation/deallocation:
>>
>> I have forgotten to tell that it was mostly GENERIC kernel just built
>> without INVARIANTS, WITNESS and SMP but with 'profile 2'.
>
> What is 'profile 2'?

I have thought it is high resolution profiling support. Isn't it?

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
In reply to this post by Julian Elischer
Julian Elischer пишет:

> Alexander Motin wrote:
>> Hi.
>>
>> While profiling netgraph operation on UP HEAD router I have found that
>> huge amount of time it spent on memory allocation/deallocation:
>>
>>         0.14  0.05  132119/545292      ip_forward <cycle 1> [12]
>>         0.14  0.05  133127/545292      fxp_add_rfabuf [18]
>>         0.27  0.10  266236/545292      ng_package_data [17]
>> [9]14.1 0.56  0.21  545292         uma_zalloc_arg [9]
>>         0.17  0.00  545292/1733401     critical_exit <cycle 2> [98]
>>         0.01  0.00  275941/679675      generic_bzero [68]
>>         0.01  0.00  133127/133127      mb_ctor_pack [103]
>>
>>         0.15  0.06  133100/545266      mb_free_ext [22]
>>         0.15  0.06  133121/545266      m_freem [15]
>>         0.29  0.11  266236/545266      ng_free_item [16]
>> [8]15.2 0.60  0.23  545266         uma_zfree_arg [8]
>>         0.17  0.00  545266/1733401     critical_exit <cycle 2> [98]
>>         0.00  0.04  133100/133100      mb_dtor_pack [57]
>>         0.00  0.00  134121/134121      mb_dtor_mbuf [111]
>>
>> I have already optimized all possible allocation calls and those that
>> left are practically unavoidable. But even after this kgmon tells that
>> 30% of CPU time consumed by memory management.
>>
>> So I have some questions:
>> 1) Is it real situation or just profiler mistake?
>> 2) If it is real then why UMA is so slow? I have tried to replace it
>> in some places with preallocated TAILQ of required memory blocks
>> protected by mutex and according to profiler I have got _much_ better
>> results. Will it be a good practice to replace relatively small UMA
>> zones with preallocated queue to avoid part of UMA calls?
>> 3) I have seen that UMA does some kind of CPU cache affinity, but does
>> it cost so much that it costs 30% CPU time on UP router?
>
> given this information, I would add an 'item cache' in ng_base.c
> (hmm do I already have one?)

That was actually my second question. As there is only 512 items by
default and they are small in size I can easily preallocate them all on
boot. But is it a good way? Why UMA can't do just the same when I have
created zone with specified element size and maximum number of objects?
What is the principal difference?

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Julian Elischer
Alexander Motin wrote:

> Julian Elischer пишет:
>> Alexander Motin wrote:
>>> Hi.
>>>
>>> While profiling netgraph operation on UP HEAD router I have found
>>> that huge amount of time it spent on memory allocation/deallocation:
>>>
>>>         0.14  0.05  132119/545292      ip_forward <cycle 1> [12]
>>>         0.14  0.05  133127/545292      fxp_add_rfabuf [18]
>>>         0.27  0.10  266236/545292      ng_package_data [17]
>>> [9]14.1 0.56  0.21  545292         uma_zalloc_arg [9]
>>>         0.17  0.00  545292/1733401     critical_exit <cycle 2> [98]
>>>         0.01  0.00  275941/679675      generic_bzero [68]
>>>         0.01  0.00  133127/133127      mb_ctor_pack [103]
>>>
>>>         0.15  0.06  133100/545266      mb_free_ext [22]
>>>         0.15  0.06  133121/545266      m_freem [15]
>>>         0.29  0.11  266236/545266      ng_free_item [16]
>>> [8]15.2 0.60  0.23  545266         uma_zfree_arg [8]
>>>         0.17  0.00  545266/1733401     critical_exit <cycle 2> [98]
>>>         0.00  0.04  133100/133100      mb_dtor_pack [57]
>>>         0.00  0.00  134121/134121      mb_dtor_mbuf [111]
>>>
>>> I have already optimized all possible allocation calls and those that
>>> left are practically unavoidable. But even after this kgmon tells
>>> that 30% of CPU time consumed by memory management.
>>>
>>> So I have some questions:
>>> 1) Is it real situation or just profiler mistake?
>>> 2) If it is real then why UMA is so slow? I have tried to replace it
>>> in some places with preallocated TAILQ of required memory blocks
>>> protected by mutex and according to profiler I have got _much_ better
>>> results. Will it be a good practice to replace relatively small UMA
>>> zones with preallocated queue to avoid part of UMA calls?
>>> 3) I have seen that UMA does some kind of CPU cache affinity, but
>>> does it cost so much that it costs 30% CPU time on UP router?
>>
>> given this information, I would add an 'item cache' in ng_base.c
>> (hmm do I already have one?)
>
> That was actually my second question. As there is only 512 items by
> default and they are small in size I can easily preallocate them all on
> boot. But is it a good way? Why UMA can't do just the same when I have
> created zone with specified element size and maximum number of objects?
> What is the principal difference?
>

who knows what uma does.. but if you do it yourself you know what the
overhead is.. :-)

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Kris Kennaway-3
In reply to this post by Alexander Motin-3
Alexander Motin wrote:

> Kris Kennaway пишет:
>> Alexander Motin wrote:
>>> Alexander Motin пишет:
>>>> While profiling netgraph operation on UP HEAD router I have found
>>>> that huge amount of time it spent on memory allocation/deallocation:
>>>
>>> I have forgotten to tell that it was mostly GENERIC kernel just built
>>> without INVARIANTS, WITNESS and SMP but with 'profile 2'.
>>
>> What is 'profile 2'?
>
> I have thought it is high resolution profiling support. Isn't it?
>

OK.  This is not commonly used so I don't know if it works.  Try using
hwpmc if possible to compare.

When you say that your own allocation routines show less time use under
profiling, how do they affect the actual system performance?

Kris

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Robert N. M. Watson-2
In reply to this post by Alexander Motin-3

On Fri, 1 Feb 2008, Alexander Motin wrote:

> That was actually my second question. As there is only 512 items by default
> and they are small in size I can easily preallocate them all on boot. But is
> it a good way? Why UMA can't do just the same when I have created zone with
> specified element size and maximum number of objects? What is the principal
> difference?

Alexander,

I think we should drill down in the analysis a bit and see if we can figure
out what's going on with UMA.  What UMA essentially does is ask the VM for
pages, and then pack objects into pages.  It maintains some meta-data, and
depending on the relative sizes of objects and pages, it may store it in the
page or potentially elsewhere.  Either way, it looks very much an array of
struct object.  It has a few extra layers of wrapping in order to maintain
stats, per-CPU caches, object life cycle, etc.  When INVARIANTS is turned off,
allocation from the per-CPU cache consists of pulling objects in and out of
one of two per-CPU queues.  So I guess the question is: where are the cycles
going?  Are we suffering excessive cache misses in managing the slabs?  Are
you effectively "cycling through" objects rather than using a smaller set that
fits better in the cache?  Is some bit of debugging enabled that shouldn't be,
perhaps due to a failure of ifdefs?

BTW, UMA does let you set the size of buckets, so you can try tuning the
bucket size.  For starts, try setting the zone flag UMA_ZONE_MAXBUCKET.

It would be very helpful if you could try doing some analysis with hwpmc --
"high resolution profiling" is of increasingly limited utility with modern
CPUs, where even a high frequency timer won't run very often.  It's also quite
subject to cycle events that align with other timers in the system.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
Hi.

Robert Watson wrote:
> It would be very helpful if you could try doing some analysis with hwpmc
> -- "high resolution profiling" is of increasingly limited utility with
> modern CPUs, where even a high frequency timer won't run very often.  
> It's also quite subject to cycle events that align with other timers in
> the system.

I have tried hwpmc but still not completely friendly with it. Whole
picture is somewhat alike to kgmon's, but it looks very noisy. Is there
some "know how" about how to use it better?

I have tried it for measuring number of instructions. But I am in doubt
that instructions is a correct counter for performance measurement as
different instructions may have very different execution times depending
on many reasons, like cache misses and current memory traffic. I have
tried to use tsc to count CPU cycles, but got the error:
# pmcstat -n 10000 -S "tsc" -O sample.out
pmcstat: ERROR: Cannot allocate system-mode pmc with specification
"tsc": Operation not supported
What have I missed?

I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but
kernel built without SMP to simplify profiling. What counters can you
recommend me to use on it for regular time profiling?

Thanks for reply.

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Bruce Evans-4
On Fri, 1 Feb 2008, Alexander Motin wrote:

> Robert Watson wrote:
>> It would be very helpful if you could try doing some analysis with hwpmc --
>> "high resolution profiling" is of increasingly limited utility with modern

You mean "of increasingly greater utility with modern CPUs".  Low resolution
kernel profiling stopped giving enough resolution in about 1990, and has
become of increasingly limited utility since then, but high resolution
kernel profiling uses the TSC or possibly a performance counter so it
has kept up with CPU speedups.  Cache effects and out of order execution
are larger now, but they affect all types of profiling and still not too
bad with high resulotion kernel profiling.  High resolution kernel profiling
doesn't really work with SMP, but that is not Alexander's problem since he
profiled under UP.

>> CPUs, where even a high frequency timer won't run very often.  It's also
>> quite subject to cycle events that align with other timers in the system.

No, it isn't affected by either of these.  The TSC timer is incremented on
every CPU cycle and the performance counters run are incremented on every
event.  It is completely unaffected by other timers.

> I have tried hwpmc but still not completely friendly with it. Whole picture
> is somewhat alike to kgmon's, but it looks very noisy. Is there some "know
> how" about how to use it better?

hwpmc doesn't work for me either.  I can't see how it could work as well
as high resolution kernel profiling for events at the single function
level, since it is statistics-based.  With the statistics clock interrupt
rate fairly limited, it just cannot get enough resolution over short runs.
Also, it works poorly for me (with a current kernel and ~5.2 userland
except for some utilities like pmc*).  Generation of profiles stopped
working for me, and it often fails with allocation errors.

> I have tried it for measuring number of instructions. But I am in doubt that
> instructions is a correct counter for performance measurement as different
> instructions may have very different execution times depending on many
> reasons, like cache misses and current memory traffic.

Cycle counts are more useful, but high resolution kernel profiling can do
this too, with some fixes:
- update perfmon for newer CPUs.  It is broken even for Athlons (takes a
   2 line fix, or more lines with proper #defines and if()s).
- select the performance counter to be used for profiling using
   sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the
   performance counter for the instruction count (or any).  I use hwpmc
   mainly to determine N :-).  It may also be necessary to change the
   kernel variable cpu_clock_pmc_conf.  Configuration of this is unfinished.
- use high resolution kernel profiling normally.  Note that switching to
   a perfmon counter is only permitted of !SMP (since it is too unsupported
   under SMP to do more than panic if permitted under SMP), and that the
   switch loses the calibration of profiling.  Profiling normally
   compensates for overheads of the profiling itself, and the compensation
   would work almoost perfectly for event counters, unlike for time-related
   counters, since the extra events for profiling aren't much affected by
   caches.

> I have tried to use
> tsc to count CPU cycles, but got the error:
> # pmcstat -n 10000 -S "tsc" -O sample.out
> pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc":
> Operation not supported
> What have I missed?

This might be just because the TSC really is not supported.  Many things
require an APIC for hwpmc to support them.

I get errors allocation like this for operations that work a few times
before failing.

> I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel
> built without SMP to simplify profiling. What counters can you recommend me
> to use on it for regular time profiling?

Try them all :-).  From userland first with an overall count, since looking
at the details in gprof output takes too long (and doesn't work for me with
hwpmc anyway).  I use scripts like the following to try them all from
userland:

runpm:
%%%
c="ttcp -n100000 -l5 -u -t epsplex"

ctr=0
while test $ctr -lt 256
do
  ctr1=$(printf "0x%02x\n" $ctr)
  case $ctr1 in
  0x00) src=k8-fp-dispatched-fpu-ops;;
  0x01) src=k8-fp-cycles-with-no-fpu-ops-retired;;
  0x02) src=k8-fp-dispatched-fpu-fast-flag-ops;;
  0x05) src=k8-fp-unknown-$ctr1;;
  0x09) src=k8-fp-unknown-$ctr1;;
  0x0d) src=k8-fp-unknown-$ctr1;;
  0x11) src=k8-fp-unknown-$ctr1;;
  0x15) src=k8-fp-unknown-$ctr1;;
  0x19) src=k8-fp-unknown-$ctr1;;
  0x1d) src=k8-fp-unknown-$ctr1;;
  0x20) src=k8-ls-segment-register-load;; # XXX
  0x21) src=kx-ls-microarchitectural-resync-by-self-mod-code;;
  0x22) src=k8-ls-microarchitectural-resync-by-snoop;;
  0x23) src=kx-ls-buffer2-full;;
  0x24) src=k8-ls-locked-operation;; # XXX
  0x25) src=k8-ls-microarchitectural-late-cancel;;
  0x26) src=kx-ls-retired-cflush-instructions;;
  0x27) src=kx-ls-retired-cpuid-instructions;;
  0x2a) src=kx-ls-unknown-$ctr1;;
  0x2b) src=kx-ls-unknown-$ctr1;;
  0x2e) src=k7-ls-unknown-$ctr1;;
  0x2f) src=k7-ls-unknown-$ctr1;;
  0x32) src=kx-ls-unknown-$ctr1;;
  0x33) src=kx-ls-unknown-$ctr1;;
  0x36) src=k7-ls-unknown-$ctr1;;
  0x37) src=k7-ls-unknown-$ctr1;;
  0x3a) src=kx-ls-unknown-$ctr1;;
  0x3b) src=kx-ls-unknown-$ctr1;;
  0x3e) src=k7-ls-unknown-$ctr1;;
  0x3f) src=k7-ls-unknown-$ctr1;;
  0x40) src=kx-dc-accesses;;
  0x41) src=kx-dc-misses;;
  0x42) src=kx-dc-refills-from-l2;; # XXX
  0x43) src=kx-dc-refills-from-system;; # XXX
  0x44) src=kx-dc-writebacks;; # XXX
  0x45) src=kx-dc-l1-dtlb-miss-and-l2-dtlb-hits;;
  0x46) src=kx-dc-l1-and-l2-dtlb-misses;;
  0x47) src=kx-dc-misaligned-references;;
  0x48) src=kx-dc-microarchitectural-late-cancel-of-an-access;;
  0x49) src=kx-dc-microarchitectural-early-cancel-of-an-access;;
  0x4a) src=k8-dc-one-bit-ecc-error;;
  0x4b) src=k8-dc-dispatched-prefetch-instructions;;
  0x4c) src=k8-dc-dcache-accesses-by-locks;;
  0x4d) src=k7-dc-unknown-$ctr1;;
  0x4e) src=k7-dc-unknown-$ctr1;;
  0x4f) src=k7-dc-unknown-$ctr1;;
  0x50) src=kx-dc-unknown-$ctr1;;
  0x51) src=kx-dc-unknown-$ctr1;;
  0x55) src=kx-dc-unknown-$ctr1;;
  0x56) src=kx-dc-unknown-$ctr1;;
  0x57) src=kx-dc-unknown-$ctr1;;
  0x58) src=kx-dc-unknown-$ctr1;;
  0x59) src=kx-dc-unknown-$ctr1;;
  0x5d) src=k7-dc-unknown-$ctr1;;
  0x5e) src=k7-dc-unknown-$ctr1;;
  0x5f) src=k7-dc-unknown-$ctr1;;
  0x64) src=kx-bu-unknown-$ctr1;;
  0x68) src=kx-bu-unknown-$ctr1;;
  0x69) src=kx-bu-unknown-$ctr1;;
  0x76) src=kx-bu-cpu-clk-unhalted;;
  0x79) src=k8-bu-unknown-$ctr1;;
  0x7d) src=k8-bu-internal-l2-request;; # XXX
  0x7e) src=k8-bu-fill-request-l2-miss;; # XXX
  0x7f) src=k8-bu-fill-into-l2;; # XXX
  0x80) src=kx-ic-fetches;;
  0x81) src=kx-ic-misses;;
  0x82) src=kx-ic-refill-from-l2;;
  0x83) src=kx-ic-refill-from-system;;
  0x84) src=kx-ic-l1-itlb-misses;;
  0x85) src=kx-ic-l1-l2-itlb-misses;;
  0x86) src=k8-ic-microarchitectural-resync-by-snoop;;
  0x87) src=kx-ic-instruction-fetch-stall;;
  0x88) src=kx-ic-return-stack-hit;;
  0x89) src=kx-ic-return-stack-overflow;;
  0xc0) src=kx-fr-retired-instructions;;
  0xc1) src=kx-fr-retired-ops;;
  0xc2) src=kx-fr-retired-branches;;
  0xc3) src=kx-fr-retired-branches-mispredicted;;
  0xc4) src=kx-fr-retired-taken-branches;;
  0xc5) src=kx-fr-retired-taken-branches-mispredicted;;
  0xc6) src=kx-fr-retired-far-control-transfers;;
  0xc7) src=kx-fr-retired-resync-branches;;
  0xc8) src=kx-fr-retired-near-returns;;
  0xc9) src=kx-fr-retired-near-returns-mispredicted;;
  0xca) src=kx-fr-retired-taken-branches-mispred-by-addr-miscompare;;
  0xcb) src=k8-fr-retired-fpu-instructions;;
  0xcc) src=k8-fr-retired-fastpath-double-op-instructions;;
  0xcd) src=kx-fr-interrupts-masked-cycles;;
  0xce) src=kx-fr-interrupts-masked-while-pending-cycles;;
  0xcf) src=kx-fr-hardware-interrupts;;
  0xd0) src=kx-fr-decoder-empty;;
  0xd1) src=kx-fr-dispatch-stalls;;
  0xd2) src=kx-fr-dispatch-stall-from-branch-abort-to-retire;;
  0xd3) src=kx-fr-dispatch-stall-for-serialization;;
  0xd4) src=kx-fr-dispatch-stall-for-segment-load;;
  0xd5) src=kx-fr-dispatch-stall-when-reorder-buffer-is-full;;
  0xd6) src=kx-fr-dispatch-stall-when-reservation-stations-are-full;;
  0xd7) src=kx-fr-dispatch-stall-when-fpu-is-full;;
  0xd8) src=kx-fr-dispatch-stall-when-ls-is-full;;
  0xd9) src=kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet;;
  0xda) src=kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending;;
  0xdb) src=k8-fr-fpu-exceptions;;
  0xdc) src=k8-fr-number-of-breakpoints-for-dr0;;
  0xdd) src=k8-fr-number-of-breakpoints-for-dr1;;
  0xde) src=k8-fr-number-of-breakpoints-for-dr2;;
  0xdf) src=k8-fr-number-of-breakpoints-for-dr3;;
  0xe0) src=k8-nb-memory-controller-page-access-event;;
  0xe1) src=k8-nb-memory-controller-page-table-overflow;;
  0xe2) src=k8-nb-memory-controller-dram-slots-missed;;
  0xe3) src=k8-nb-memory-controller-turnaround;;
  0xe4) src=k8-nb-memory-controller-bypass-saturation;; # XXX
  0xe5) src=k8-nb-sized-commands;; # XXX
  0xec) src=k8-nb-probe-result;; # XXX
  0xf6) src=k8-nb-ht-bus0-bandwidth;;
  0xf7) src=k8-nb-ht-bus1-bandwidth;;
  0xf8) src=k8-nb-ht-bus2-bandwidth;;
  0xfc) src=k8-nb-unknown-$ctr1;;
  *) src=very-unknown-$ctr1;;
  esac
  case $src in
  k8-*) ctr=$(($ctr + 1)); continue;;
  *unknown-*) ctr=$(($ctr + 1)); continue;;
  esac
  echo "# s/$src "
  perfmon -c "$c" -ou -l 1 $ctr |
     egrep -v '(^total: |^mean: |^clocks \(at)' | sed -e 's/1: //'
  ctr=$(($ctr + 1))
done
%%%

runpmc:
%%%
for i in \
  k8-fp-dispatched-fpu-ops \
  k8-fp-cycles-with-no-fpu-ops-retired \
  k8-fp-dispatched-fpu-fast-flag-ops \
  k8-ls-segment-register-load \
  k8-ls-microarchitectural-resync-by-self-modifying-code \
  k8-ls-microarchitectural-resync-by-snoop \
  k8-ls-buffer2-full \
  k8-ls-locked-operation \
  k8-ls-microarchitectural-late-cancel \
  k8-ls-retired-cflush-instructions \
  k8-ls-retired-cpuid-instructions \
  k8-dc-access \
  k8-dc-miss \
  k8-dc-refill-from-l2 \
  k8-dc-refill-from-system \
  k8-dc-copyback \
  k8-dc-l1-dtlb-miss-and-l2-dtlb-hit \
  k8-dc-l1-dtlb-miss-and-l2-dtlb-miss \
  k8-dc-misaligned-data-reference \
  k8-dc-microarchitectural-late-cancel-of-an-access \
  k8-dc-microarchitectural-early-cancel-of-an-access \
  k8-dc-one-bit-ecc-error \
  k8-dc-dispatched-prefetch-instructions \
  k8-dc-dcache-accesses-by-locks \
  k8-bu-cpu-clk-unhalted \
  k8-bu-internal-l2-request \
  k8-bu-fill-request-l2-miss \
  k8-bu-fill-into-l2 \
  k8-ic-fetch \
  k8-ic-miss \
  k8-ic-refill-from-l2 \
  k8-ic-refill-from-system \
  k8-ic-l1-itlb-miss-and-l2-itlb-hit \
  k8-ic-l1-itlb-miss-and-l2-itlb-miss \
  k8-ic-microarchitectural-resync-by-snoop \
  k8-ic-instruction-fetch-stall \
  k8-ic-return-stack-hit \
  k8-ic-return-stack-overflow \
  k8-fr-retired-x86-instructions \
  k8-fr-retired-uops \
  k8-fr-retired-branches \
  k8-fr-retired-branches-mispredicted \
  k8-fr-retired-taken-branches \
  k8-fr-retired-taken-branches-mispredicted \
  k8-fr-retired-far-control-transfers \
  k8-fr-retired-resyncs \
  k8-fr-retired-near-returns \
  k8-fr-retired-near-returns-mispredicted \
  k8-fr-retired-taken-branches-mispredicted-by-addr-miscompare \
  k8-fr-retired-fpu-instructions \
  k8-fr-retired-fastpath-double-op-instructions \
  k8-fr-interrupts-masked-cycles \
  k8-fr-interrupts-masked-while-pending-cycles \
  k8-fr-taken-hardware-interrupts \
  k8-fr-decoder-empty \
  k8-fr-dispatch-stalls \
  k8-fr-dispatch-stall-from-branch-abort-to-retire \
  k8-fr-dispatch-stall-for-serialization \
  k8-fr-dispatch-stall-for-segment-load \
  k8-fr-dispatch-stall-when-reorder-buffer-is-full \
  k8-fr-dispatch-stall-when-reservation-stations-are-full \
  k8-fr-dispatch-stall-when-fpu-is-full \
  k8-fr-dispatch-stall-when-ls-is-full \
  k8-fr-dispatch-stall-when-waiting-for-all-to-be-quiet \
  k8-fr-dispatch-stall-when-far-xfer-or-resync-branch-pending \
  k8-fr-fpu-exceptions \
  k8-fr-number-of-breakpoints-for-dr0 \
  k8-fr-number-of-breakpoints-for-dr1 \
  k8-fr-number-of-breakpoints-for-dr2 \
  k8-fr-number-of-breakpoints-for-dr3 \
  k8-nb-memory-controller-page-table-overflow \
  k8-nb-memory-controller-dram-slots-missed \
  k8-nb-memory-controller-bypass-saturation \
  k8-nb-sized-commands \
  k8-nb-probe-result

do
  pmcstat -s $i sleep 1 2>&1 | sed -e 's/^ *//' -e 's/  */ /' \
     -e 's/ *$//' -e 's/\/00\/k8-/\/k8-/'
done
%%%

runpmc7:
%%%
for i in \
  k7-dc-accesses \
  k7-dc-misses \
  k7-dc-refills-from-l2 \
  k7-dc-refills-from-system \
  k7-dc-writebacks \
  k7-l1-dtlb-miss-and-l2-dtlb-hits \
  k7-l1-and-l2-dtlb-misses \
  k7-misaligned-references \
  k7-ic-fetches \
  k7-ic-misses \
  k7-l1-itlb-misses \
  k7-l1-l2-itlb-misses \
  k7-retired-instructions \
  k7-retired-ops \
  k7-retired-branches \
  k7-retired-branches-mispredicted \
  k7-retired-taken-branches \
  k7-retired-taken-branches-mispredicted \
  k7-retired-far-control-transfers \
  k7-retired-resync-branches \
  k7-interrupts-masked-cycles \
  k7-interrupts-masked-while-pending-cycles \
  k7-hardware-interrupts

do
  pmcstat -s $i sleep 1 2>&1 |
     sed -e 's/^ *//' -e 's/  */ /' -e 's/ *$//' -e 's/k7/kx/'
done
%%%

"runpm" tries up to all 256 perfomance counters, with names like the
hwpmc ones.  k7 means AthlonXP only; k8 means Athlon64 only; kx means
both, but many kx's don't really work or are not documented for both.
A few like kx-fr-retired-near-returns-mispredicted (?) are not documented
for AXP but seem to work and are useful.

runpmc tries the documented A64 counters.  runpmc7 tries the documented
AXP counters.  hwpmc is less useful than perfmon here since it doesn't
support using the undocumented counters.  There is a pmc* option that
prints all the counters in the above lists.  I checked that they are
almost precisely the documented (in Athlon optimization manuals) ones.
Names are unfortunately inconsistent between k7 and k8 in some cases,
following inconsistencies in the documentation.

I don't know anything about Pentium counters except what is in source
code.

gprof output for the mumble perfmon counter (kx-dc-misses?) while sending
100000 tiny packets using ttcp -t looks like this (after fixing the
calibration):

%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.81 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  us/call  us/call  name
  11.0      0.308    0.308   100083        3       24  syscall [3]
  10.8      0.613    0.305   200012        2        2  rn_match [16]
   4.4      0.738    0.125   100019        1        1  _bus_dmamap_load_buffer [25]
   4.3      0.859    0.121   300107        0        0  generic_copyin [27]
   4.0      0.973    0.114   100006        1        9  ip_output [10]
   3.8      1.079    0.106   100006        1        4  ether_output [12]
   3.7      1.182    0.103   100007        1        1  fgetsock [30]
   3.6      1.284    0.102   100006        1        2  bus_dmamap_load_mbuf [21]
   3.6      1.385    0.101   200012        1        3  rtalloc_ign [11]
   3.6      1.486    0.101   100083        1       25  Xint0x80_syscall [2]
   3.6      1.587    0.101   200012        1        1  in_clsroute [32]
   3.6      1.688    0.101   100006        1       20  sendto [4]
   3.6      1.789    0.101   100008        1        1  in_pcblookup_hash [33]
   3.6      1.890    0.101   100006        1       16  kern_sendit [6]
   3.6      1.990    0.100   200012        1        2  in_matroute [15]
   3.6      2.091    0.100   100748        1        1  doreti [35]
   3.6      2.191    0.100   100007        1        2  getsockaddr [22]
%%%

I would like to be able to do this with hwpmc but don't see how it can.
Only (non-statistical) counting at every function call and return can
give precise counts like the above.  However, non-statistical counting
is better for some things.

Back to the original problem.  Uma allocation overhead shows up in TSC
profiles of ttcp, but is just one of too many things that take a while.
There are about function calls, each taking about 1%:

% granularity: each sample hit covers 16 byte(s) for 0.00% of 0.86 seconds
%
%   %   cumulative   self              self     total
%  time   seconds   seconds    calls  ns/call  ns/call  name
%  44.9      0.388    0.388        0  100.00%           mcount [1]
%  20.9      0.569    0.180        0  100.00%           mexitcount [7]
%   8.0      0.638    0.069        0  100.00%           cputime [14]
%   1.8      0.654    0.016        0  100.00%           user [25]
%   1.6      0.668    0.014   100006      143     1051  udp_output [12]
%   1.5      0.681    0.013   100006      133      704  ip_output [13]
%   1.3      0.692    0.011   300120       37       37  copyin [30]
%   1.2      0.702    0.010   100006      100     1360  sosend_dgram [10]
%   0.9      0.710    0.008   200012       39       39  rn_match [38]
%   0.9      0.718    0.007   300034       25       25  memcpy [39]
%   0.8      0.725    0.007   200103       36       58  uma_zalloc_arg [29]
%   0.8      0.732    0.007   100090       68     1977  syscall [4]

All the times seem reasonable.  Without profiling, sendto() and overheads
takes about 1700 nsec in -current and about 1600 nsec in my version
of 5.2.  (This is for -current.  The 100 nsec difference is very hard
to understand in detail.)  With high resolution kernel profiling, sendto()
and overheads take about 8600 nsec here.  Profiling has subtracted its
own overheads and the result of 1977 nsec for syscall is consistent with
syscall() taking a bit less that 1700 nsec when not looked at.  (Profiling
only subtracts its best-case overheads.  Its runtime overheads must be
larger due to cache effects, and if these are very large then we cannot
trust the compensation.  Since it compensated from 8600 down to about 1977,
it has clearly down the compensation almost right.  The compensation is
delicate when there are a lot of functions taking ~20 nsec since the profiling
overhead per function call is 82 nsec.

%   0.8      0.738    0.007   200012       33       84  rtalloc1 [24]
%   0.8      0.745    0.006   100006       65      139  bge_encap [26]
%   0.7      0.751    0.006   100006       62      201  bge_start_locked [20]
%   0.6      0.757    0.006   200075       28       28  bzero [44]
%   0.6      0.761    0.005   100006       48      467  ether_output [15]
%   0.6      0.766    0.005   100006       48      192  m_uiotombuf [22]
%   0.5      0.771    0.005   200100       23       45  uma_zfree_arg [33]
%   0.5      0.775    0.005   100006       46       46  bus_dmamap_load_mbuf_sg [45]
%   0.5      0.780    0.004   100028       45      132  malloc [27]
%   0.5      0.784    0.004   200012       20      104  rtalloc_ign [19]

This is hard to optimize.

uma has shown up as taking 58 nsec for uma_zalloc_arg() (including what it
calls) and 45 nsec for uma_zfree_arg().  This is on a 2.2GHz A64.  Anything
larger than that might be a profiling error.  But thes allocations here are
tiny -- maybe large allocations cause cache methods.

I was able to optimize away most the allocation overheads in sendto()
be allocating the sockaddr on the stack, but this made little difference
overall.  (It reduces dynamic allocations per packet from 2 to 1.  Both
allocations use malloc() so they are a bit slower than pure uma.  BTW,
switching from mbuf-based allocation to malloc() in getsockaddr() etc.
long ago cost 10 usec on a P1/133.  A loss of 10000 nsec makes the overhead
of 200 nsec for malloc now look tiny.)

Remember I said that differences of 100 nsec are hard to understand?
It is also not easy to understand why eliminating potentially 100 nsec
of malloc() overhead makes almost no difference overall.  The 100 nsec
gets distributed differently, or maybe the profiling really was wrong
for the malloc() part.

Reads of the TSC are excuted possibly-out-of-order on some CPUs.  This
doesn't seem to have much effect on the accuracy of high resolution
(TSC) kernel profiling on at least Athlons.  rdtsc takes only 12 cycles
on AXP-A64.  I think it takes much longer on Pentiums.  On Phenom it
takes ~42 cycles (pessimized to share it across CPUs :-().  With it
taking much longer than the functions that it profiles, the compensation
might become too fragile.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
In reply to this post by Robert N. M. Watson-2
Robert Watson wrote:
> I guess the question is: where are the cycles going?  Are we suffering
> excessive cache misses in managing the slabs?  Are you effectively
> "cycling through" objects rather than using a smaller set that fits
> better in the cache?

In my test setup only several objects from zone usually allocated same
time, but they allocated two times per every packet.

To check UMA dependency I have made a trivial one-element cache which in
my test case allows to avoid two for four allocations per packet.
.....alloc.....
-       item = uma_zalloc(ng_qzone, wait | M_ZERO);
+       mtx_lock_spin(&itemcachemtx);
+       item = itemcache;
+       itemcache = NULL;
+       mtx_unlock_spin(&itemcachemtx);
+       if (item == NULL)
+               item = uma_zalloc(ng_qzone, wait | M_ZERO);
+       else
+               bzero(item, sizeof(*item));
.....free.....
-       uma_zfree(ng_qzone, item);
+       mtx_lock_spin(&itemcachemtx);
+       if (itemcache == NULL) {
+               itemcache = item;
+               item = NULL;
+       }
+       mtx_unlock_spin(&itemcachemtx);
+       if (item)
+               uma_zfree(ng_qzone, item);
...............

To be sure that test system is CPU-bound I have throttled it with sysctl
to 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has
grown from 17 to 21Mbytes/s. Profiling results I have sent promised
close results.

> Is some bit of debugging enabled that shouldn't
> be, perhaps due to a failure of ifdefs?

I have commented out all INVARIANTS and WITNESS options from GENERIC
kernel config. What else should I check?

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Robert N. M. Watson-2

On Sat, 2 Feb 2008, Alexander Motin wrote:

> Robert Watson wrote:
>> I guess the question is: where are the cycles going?  Are we suffering
>> excessive cache misses in managing the slabs?  Are you effectively "cycling
>> through" objects rather than using a smaller set that fits better in the
>> cache?
>
> In my test setup only several objects from zone usually allocated same time,
> but they allocated two times per every packet.
>
> To check UMA dependency I have made a trivial one-element cache which in my
> test case allows to avoid two for four allocations per packet.

Avoiding unnecessary allocations is a good general principle, but duplicating
cache logic is a bad idea.  If you're able to structure the below without
using locking, it strikes me you'd do much better, especially if it's in a
single processing pass.  Can you not use a per-thread/stack/session variable
to avoid that?

> .....alloc.....
> -       item = uma_zalloc(ng_qzone, wait | M_ZERO);
> +       mtx_lock_spin(&itemcachemtx);
> +       item = itemcache;
> +       itemcache = NULL;
> +       mtx_unlock_spin(&itemcachemtx);

Why are you using spin locks?  They are quite a bit more expensive on several
hardwawre platforms, and any environment it's safe to call uma_zalloc() from
will be equally safe to use regular mutexes from (i.e., mutex-sleepable).

> +       if (item == NULL)
> +               item = uma_zalloc(ng_qzone, wait | M_ZERO);
> +       else
> +               bzero(item, sizeof(*item));
> .....free.....
> -       uma_zfree(ng_qzone, item);
> +       mtx_lock_spin(&itemcachemtx);
> +       if (itemcache == NULL) {
> +               itemcache = item;
> +               item = NULL;
> +       }
> +       mtx_unlock_spin(&itemcachemtx);
> +       if (item)
> +               uma_zfree(ng_qzone, item);
> ...............
>
> To be sure that test system is CPU-bound I have throttled it with sysctl to
> 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has grown
> from 17 to 21Mbytes/s. Profiling results I have sent promised close results.
>
>> Is some bit of debugging enabled that shouldn't be, perhaps due to a
>> failure of ifdefs?
>
> I have commented out all INVARIANTS and WITNESS options from GENERIC kernel
> config. What else should I check?

Hence my request for drilling down a bit on profiling -- the question I'm
asking is whether profiling shows things running or taking time that shouldn't
be.

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Joseph Koshy
In reply to this post by Alexander Motin-3
> I have tried it for measuring number of instructions. But I am in doubt
> that instructions is a correct counter for performance measurement as
> different instructions may have very different execution times depending
> on many reasons, like cache misses and current memory traffic. I have
> tried to use tsc to count CPU cycles, but got the error:
> # pmcstat -n 10000 -S "tsc" -O sample.out
> pmcstat: ERROR: Cannot allocate system-mode pmc with specification
> "tsc": Operation not supported
> What have I missed?

You cannot sample with the TSC since the TSC does not interrupt the CPU.
For CPU cycles you would probably want to use "p4-global-power-events";
see pmc(3).

Regards,
Koshy
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
Joseph Koshy wrote:
> You cannot sample with the TSC since the TSC does not interrupt the CPU.
> For CPU cycles you would probably want to use "p4-global-power-events";
> see pmc(3).

Thanks, I have already found this. There was only problem, that by
default it counts cycles only when both logical cores are active while
one of my cores was halted.
Sampling on this, profiler shown results close to usual profiling, but
looking more random:

          175.97     1.49       1/64          ip_input <cycle 1> [49]
          175.97     1.49       1/64          g_alloc_bio [81]
          175.97     1.49       1/64          ng_package_data [18]
         1055.81     8.93       6/64          em_handle_rxtx [4]
         2639.53    22.32      15/64          em_get_buf [19]
         3343.41    28.27      19/64          ng_getqblk [17]
         3695.34    31.25      21/64          ip_forward <cycle 1> [14]
[9]21.6 11262.00   95.23      64         uma_zalloc_arg [9]
           35.45    13.03       5/22          critical_exit [75]
           26.86     0.00      22/77          critical_enter [99]
           19.89     0.00      18/19          mb_ctor_mbuf [141]


           31.87     0.24       4/1324        ng_ether_rcvdata [13]
           31.87     0.24       4/1324        ip_forward <cycle 1> [14]
           95.60     0.73      12/1324        ng_iface_rcvdata <cycle 1>
[16]
          103.57     0.79      13/1324        m_freem [25]
          876.34     6.71     110/1324        mb_free_ext [30]
         9408.75    72.01    1181/1324        ng_free_item [11]
[10]20.2 10548.00  80.73    1324         uma_zfree_arg [10]
           26.86     0.00      22/77          critical_enter [99]
           15.00    11.59       7/7           mb_dtor_mbuf [134]
           19.00     6.62       4/4           mb_dtor_pack [136]
            1.66     0.00       1/32          m_tag_delete_chain [114]


  21.4   11262.00 11262.00       64 175968.75 177456.76  uma_zalloc_arg [9]
  20.1   21810.00 10548.00     1324  7966.77  8027.74  uma_zfree_arg [10]
   5.6   24773.00  2963.00     1591  1862.35  2640.07  ng_snd_item
<cycle 1> [15]
   3.5   26599.00  1826.00       33 55333.33 55333.33  ng_address_hook [20]
   2.4   27834.00  1235.00      319  3871.47  3871.47  ng_acquire_read [28]

To make statistics better I need to record sampling data with smaller
period, but too much data creates additional overhead including disc
operations and brakes statistics. Is there any way to make it more
precise? What sampling parameters should I use for better results?

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Joseph Koshy
> Thanks, I have already found this. There was only problem, that by
> default it counts cycles only when both logical cores are active while
> one of my cores was halted.

Did you try the 'active' event modifier: "p4-global-power-events,active=any"?

> Sampling on this, profiler shown results close to usual profiling, but
> looking more random:

Adding '-fno-omit-frame-pointer' to CFLAGS may help hwpmc to capture
callchains better.

Regards,
Koshy
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Peter Jeremy
In reply to this post by Alexander Motin-3
On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
>To check UMA dependency I have made a trivial one-element cache which in my
>test case allows to avoid two for four allocations per packet.

You should be able to implement this lockless using atomic(9).  I haven't
verified it, but the following should work.

>.....alloc.....
>-       item = uma_zalloc(ng_qzone, wait | M_ZERO);

>+       mtx_lock_spin(&itemcachemtx);
>+       item = itemcache;
>+       itemcache = NULL;
>+       mtx_unlock_spin(&itemcachemtx);
 = item = atomic_readandclear_ptr(&itemcache);

>+       if (item == NULL)
>+               item = uma_zalloc(ng_qzone, wait | M_ZERO);
>+       else
>+               bzero(item, sizeof(*item));

>.....free.....
>-       uma_zfree(ng_qzone, item);

>+       mtx_lock_spin(&itemcachemtx);
>+       if (itemcache == NULL) {
>+               itemcache = item;
>+               item = NULL;
>+       }
>+       mtx_unlock_spin(&itemcachemtx);
>+       if (item)
>+               uma_zfree(ng_qzone, item);
 = if (atomic_cmpset_ptr(&itemcache, NULL, item) == 0)
 = uma_zfree(ng_qzone, item);

--
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Alexander Motin-3
Peter Jeremy пишет:
> On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
>> To check UMA dependency I have made a trivial one-element cache which in my
>> test case allows to avoid two for four allocations per packet.
>
> You should be able to implement this lockless using atomic(9).  I haven't
> verified it, but the following should work.

I have tried this, but man 9 atomic talks:

The atomic_readandclear() functions are not implemented for the types
``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any
variants with memory barriers at this time.

--
Alexander Motin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Memory allocation performance

Peter Jeremy
On Sat, Feb 02, 2008 at 09:56:42PM +0200, Alexander Motin wrote:

>Peter Jeremy ?????:
>> On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
>>> To check UMA dependency I have made a trivial one-element cache which in
>>> my test case allows to avoid two for four allocations per packet.
>>
>> You should be able to implement this lockless using atomic(9).  I haven't
>> verified it, but the following should work.
>
>I have tried this, but man 9 atomic talks:
>
>The atomic_readandclear() functions are not implemented for the types
>``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any
>variants with memory barriers at this time.
Hmmm.  This seems to be more a documentation bug than missing code:
atomic_readandclear_ptr() seems to be implemented on most
architectures (the only one where I can't find it is arm) and is
already used in malloc(3).

--
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.

attachment0 (194 bytes) Download Attachment
12