hwpmc granularity and 6.4 network performance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

hwpmc granularity and 6.4 network performance

Vadim Goncharov
Hi!

I've recently perfromed upgrade of busy production router from 6.2 to 6.4-PRE.
I have added two lines to my kernel config and did usual make buildkernel:

device          hwpmc                   # Driver (also a loadable module)
options         HWPMC_HOOKS             # Other necessary kernel hooks

After rebooting with new world and kernel, I've noticed that CPU load has
slightly increased (not measured, it is different every second anyway, as
users do not genereate steady traffic), and in top -S 'swi1: net' became
often in state *Giant, but it not used to do so on 6.2, while kernel config
did not changed much, and device polling is still used. What could happen
to this?

Another question, I've read "Sixty second HWPMC howto" and tried to find out
what exactly eats my CPU. BTW, that instruction did not apply exactly on my
machine, this is what I did:

# cd /tmp
# pmcstat -S instructions -O /tmp/sample.out
# pmcstat -R /tmp/sample.out -k /boot/kernel/kernel -g
# gprof /boot/kernel/kernel p4-instr-retired/kernel.gmon > kernel.gmon.result

Now in file kernel.gmon.result I see the following:

granularity: each sample hit covers 4 byte(s) for 0.00% of 692213.00 seconds

                                  called/total       parents
index  %time    self descendents  called+self    name           index
                                  called/total       children

                                                     <spontaneous>
[1]     31.7 219129.00        0.00                 ipfw_chk [1]

-----------------------------------------------


[...]

Why does it show 0.00 in this column ?

On next listing, flat profile, I see more readable listing, but columns are
empty again:

granularity: each sample hit covers 4 byte(s) for 0.00% of 692213.00 seconds

  %   cumulative   self              self     total          
 time   seconds   seconds    calls  ms/call  ms/call  name    
 31.7  219129.00 219129.00                             ipfw_chk [1]
 10.4  291179.00 72050.00                             bcmp [2]
  6.1  333726.00 42547.00                             rn_match [3]
  2.7  352177.00 18451.00                             generic_bzero [4]
  2.4  368960.00 16783.00                             strncmp [5]

OK, I can conclude from this that I should optimize my ipfw ruleset, but
that's all. I know from sources that ipfw_chk() is a big function with a
bunch of 'case's in a large 'switch'. I want to know which parts of that
switch are executed more often. It says in listing that granularity is
4 bytes, I assume that it has a sample for each of 4-byte chunks of binary
code, so that it must have such information. My kernel is compiled with:

makeoptions     DEBUG=-g

so kgdb does know where are instructions for each line of source code.
How can I obtain this info from profiling? It also would be useful to know
which places do calls to that bcmp() and rn_match().

--
WBR, Vadim Goncharov. ICQ#166852181       mailto:[hidden email]
[Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: hwpmc granularity and 6.4 network performance

Adrian Chadd-2
A few things!

* Since you've changed two things - hwpmc _AND_ the kernel version -
you can't easily conclude which one (if any!) has any influence on
Giant showing up in your top output. I suggest recompiling without
hwpmc and seeing if the behaviour changes.

* The gprof utility expects something resembling "time" for the
sampling data, but pmcstat doesn't record time, it records "events".
The counts you see in gprof are "events", so change "seconds" to
"events" in your reading of the gprof output.

* I don't know if the backported pmc to 6.4 handles stack call graphs
or not. Easy way to check - pmcstat -R sample.out | more ; see if you
just see "sample" lines or "sample" and "callgraph" lines.

* I bet that ipfw_chk is a big enough hint. How big is your ipfw ruleset? :)



Adrian

2008/11/24 Vadim Goncharov <[hidden email]>:

> Hi!
>
> I've recently perfromed upgrade of busy production router from 6.2 to 6.4-PRE.
> I have added two lines to my kernel config and did usual make buildkernel:
>
> device          hwpmc                   # Driver (also a loadable module)
> options         HWPMC_HOOKS             # Other necessary kernel hooks
>
> After rebooting with new world and kernel, I've noticed that CPU load has
> slightly increased (not measured, it is different every second anyway, as
> users do not genereate steady traffic), and in top -S 'swi1: net' became
> often in state *Giant, but it not used to do so on 6.2, while kernel config
> did not changed much, and device polling is still used. What could happen
> to this?
>
> Another question, I've read "Sixty second HWPMC howto" and tried to find out
> what exactly eats my CPU. BTW, that instruction did not apply exactly on my
> machine, this is what I did:
>
> # cd /tmp
> # pmcstat -S instructions -O /tmp/sample.out
> # pmcstat -R /tmp/sample.out -k /boot/kernel/kernel -g
> # gprof /boot/kernel/kernel p4-instr-retired/kernel.gmon > kernel.gmon.result
>
> Now in file kernel.gmon.result I see the following:
>
> granularity: each sample hit covers 4 byte(s) for 0.00% of 692213.00 seconds
>
>                                  called/total       parents
> index  %time    self descendents  called+self    name           index
>                                  called/total       children
>
>                                                     <spontaneous>
> [1]     31.7 219129.00        0.00                 ipfw_chk [1]
>
> -----------------------------------------------
>
>
> [...]
>
> Why does it show 0.00 in this column ?
>
> On next listing, flat profile, I see more readable listing, but columns are
> empty again:
>
> granularity: each sample hit covers 4 byte(s) for 0.00% of 692213.00 seconds
>
>  %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  31.7  219129.00 219129.00                             ipfw_chk [1]
>  10.4  291179.00 72050.00                             bcmp [2]
>  6.1  333726.00 42547.00                             rn_match [3]
>  2.7  352177.00 18451.00                             generic_bzero [4]
>  2.4  368960.00 16783.00                             strncmp [5]
>
> OK, I can conclude from this that I should optimize my ipfw ruleset, but
> that's all. I know from sources that ipfw_chk() is a big function with a
> bunch of 'case's in a large 'switch'. I want to know which parts of that
> switch are executed more often. It says in listing that granularity is
> 4 bytes, I assume that it has a sample for each of 4-byte chunks of binary
> code, so that it must have such information. My kernel is compiled with:
>
> makeoptions     DEBUG=-g
>
> so kgdb does know where are instructions for each line of source code.
> How can I obtain this info from profiling? It also would be useful to know
> which places do calls to that bcmp() and rn_match().
>
> --
> WBR, Vadim Goncharov. ICQ#166852181       mailto:[hidden email]
> [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to "[hidden email]"
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: hwpmc granularity and 6.4 network performance

Vadim Goncharov
Hi Adrian Chadd!

On Tue, 25 Nov 2008 15:09:19 -0500; Adrian Chadd wrote about 'Re: hwpmc granularity and 6.4 network performance':

> * Since you've changed two things - hwpmc _AND_ the kernel version -
> you can't easily conclude which one (if any!) has any influence on
> Giant showing up in your top output. I suggest recompiling without
> hwpmc and seeing if the behaviour changes.

This is not so easy to do at the time when I want :) I will check this
some weeks later, may be.

> * The gprof utility expects something resembling "time" for the
> sampling data, but pmcstat doesn't record time, it records "events".
> The counts you see in gprof are "events", so change "seconds" to
> "events" in your reading of the gprof output.

Of course, I know this, but it doesn't change the percentage.

> * I don't know if the backported pmc to 6.4 handles stack call graphs
> or not. Easy way to check - pmcstat -R sample.out | more ; see if you
> just see "sample" lines or "sample" and "callgraph" lines.

No.

> * I bet that ipfw_chk is a big enough hint. How big is your ipfw ruleset? :)

It's not so big in terms of rule count and not so big in terms of exact hint,
but it is of course big as a CPU hog :)

router# ipfw show | wc -l
      70

Surely, not so much, yes? So I want to see which parts are more CPU-intensive,
to use as a hint when rewriting ruleset.

I've heard about a pmcannotate tool, in -arch@, and I think that it is tool
which does the thing exactly what I want, but that requires patch for pmcstat
which didn't apply on my 6.4, too much was different :(

>> OK, I can conclude from this that I should optimize my ipfw ruleset, but
>> that's all. I know from sources that ipfw_chk() is a big function with a
>> bunch of 'case's in a large 'switch'. I want to know which parts of that
>> switch are executed more often. It says in listing that granularity is
>> 4 bytes, I assume that it has a sample for each of 4-byte chunks of binary
>> code, so that it must have such information. My kernel is compiled with:
>>
>> makeoptions     DEBUG=-g
>>
>> so kgdb does know where are instructions for each line of source code.
>> How can I obtain this info from profiling? It also would be useful to know
>> which places do calls to that bcmp() and rn_match().


--
WBR, Vadim Goncharov. ICQ#166852181       mailto:[hidden email]
[Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"