Using sysctl(1) to gather resource consumption data

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Using sysctl(1) to gather resource consumption data

David Wolfskill
At $work, I've been trying to gather information on "interesting
patterns" of resource consumption during moderately long-running (5 - 8
hour) tasks; the hosts in question usually run FreeBSD 6.2, though
there's an occasional 6.x that's more recent, as well as a bit of
7-STABLE.

I wanted to have a low impact on the system being measured (of course),
and I was unwilling to require that a system to be measured had any
software installed on it other than base FreeBSD.  (Yes, that means I
didn't assume Perl, though in practice in this environment, each does.)

I also wanted the data to be transferred reasonably securely, even if
part of that transit was over facilities over which I had no control.
(Some of the machines being measured happen to be in a continent other
than where I am.)

So I cobbled up a Perl script to run on a data-gathering machine (that
one was mine, so I could require that it had any software I wanted on
it); it acts (if you will) as a "shepherd," watching over child
processes, one of which is created for each host to be measured.

A given child process copies over a shell script to the remote machine,
then redirects STDOUT to append to a file on the data-gathering machine,
and exec()s ssh(1), telling it to run the shell script on the remote
machine.

The shell script fabricates a string (depending on the arguments with
which it was invoked), then sits in a loop:

* eval the string
* sleep for the amount of time remaining

indefinitely.  (In practice, the usual nominal time between successive
eval()s is 5 minutes.  I have recently been doing some experiments at a
10-second interval.)

Periodically, back on the data-gathering machine, a couple of different
things happen:

* The "shepherd" script wakes up and checks the mtime on the file for
  each per-host process (to see if it's been updated "sufficiently
  recently").  Acttually, it first checks the file that lists the hosts
  to watch; if its mtime has changed, it's re-read, and the list of
  hosts is modified as appropriate.  Anyway, if a given per-host file is
  "too old," the corresponding child process is killed.  The the
  script runs through the list of hosts that should be checked,
  creating a per-host process for each one for which that's necessary.

  There's a fair amount of detail I'm eliding (such as limited
  exponential backoff for unresponsive hosts).

  In practice, this runs every 2 minutes at the moment.

* There's a cron(8)-initiated make(1) process that runs, reading the
  files created by the per-host processes and writing to a corresponding
  RRD.  (I cobbled up a Perl script to do this.)

While I tried to externalize a fair amount of this -- e.g., the list of
sysctl(1) OIDs to use is read from an external file -- it turns out that
certain types of change are a bit ... painful.  In particular, adding a
new "data source" to the RRD qualifies (as "painful").

I recently modified the scripts involved to allow them to also be used
to gather per-NIC statistics (via invocation of "netstat -nibf inet").

I'm about to implement that change over the weekend, so it occurred to
me that this might be a good time to add some more sysctl(1) OIDs.

So I'm asking for suggestions -- ideally, for OIDs that are fairly
easily parseable.  (I started being limited to only OIDs that were
presented as a single numeric value per line, then figured out how to
handle kern.cp_time (which is an ordered quintuple); later I figured out
how to cope with vm.loadavg (which is an order triplet ... surrounded by
curly braces).  I don't currently have logic to cope with anything more
complicated than those.)

Here's a list of the OIDs I'm currently using:

debug.dir_entry
debug.direct_blk_ptrs
debug.numcache
debug.numcachehv
debug.numneg
debug.to_avg_depth
debug.to_avg_gcalls
debug.to_avg_mpcalls
hw.usermem
kern.cp_time
kern.ipc.max_datalen
kern.ipc.max_hdr
kern.ipc.maxsockbuf
kern.ipc.msgmax
kern.ipc.msgmnb
kern.ipc.msgmni
kern.ipc.msgtql
kern.ipc.nmbclusters
kern.ipc.nmbjumbo16
kern.ipc.nmbjumbo9
kern.ipc.nmbjumbop
kern.ipc.nsfbufs
kern.ipc.nsfbufspeak
kern.ipc.nsfbufsused
kern.ipc.numopensockets
kern.ipc.pipekva
kern.ipc.pipes
kern.kstack_pages
kern.malloc_count
kern.maxfiles
kern.maxusers
kern.nselcoll
kern.openfiles
net.isr.count
net.isr.deferred
net.isr.directed
net.isr.drop
net.isr.queued
vfs.bufdefragcnt
vfs.buffreekvacnt
vfs.bufmallocspace
vfs.bufreusecnt
vfs.bufspace
vfs.cache.dotdothits
vfs.cache.dothits
vfs.cache.numcache
vfs.cache.numcalls
vfs.cache.numchecks
vfs.cache.numfullpathcalls
vfs.cache.numfullpathfail1
vfs.cache.numfullpathfail2
vfs.cache.numfullpathfail4
vfs.cache.numfullpathfound
vfs.cache.nummiss
vfs.cache.nummisszap
vfs.cache.numneg
vfs.cache.numneghits
vfs.cache.numnegzaps
vfs.cache.numposhits
vfs.cache.numposzaps
vfs.dirtybufferflushes
vfs.dirtybufthresh
vfs.flushwithdeps
vfs.freevnodes
vfs.getnewbufcalls
vfs.getnewbufrestarts
vfs.hibufspace
vfs.hidirtybuffers
vfs.hirunningspace
vfs.lobufspace
vfs.lodirtybuffers
vfs.lorunningspace
vfs.maxbufspace
vfs.maxmallocbufspace
vfs.nfs.downdelayinitial
vfs.nfs.downdelayinterval
vfs.nfs.realign_count
vfs.nfs.realign_test
vfs.nfs.reconnects
vfs.nfs4.access_cache_timeout
vfs.numdirtybuffers
vfs.numfreebuffers
vfs.numvnodes
vfs.read_max
vfs.reassignbufcalls
vfs.wantfreevnodes
vfs.write_behind
vm.loadavg
vm.stats.misc.cnt_prezero
vm.stats.misc.zero_page_count
vm.stats.sys.v_intr
vm.stats.sys.v_soft
vm.stats.sys.v_swtch
vm.stats.sys.v_syscall
vm.stats.sys.v_trap
vm.stats.vm.v_active_count
vm.stats.vm.v_cow_faults
vm.stats.vm.v_cow_optim
vm.stats.vm.v_forkpages
vm.stats.vm.v_forks
vm.stats.vm.v_free_count
vm.stats.vm.v_inactive_count
vm.stats.vm.v_intrans
vm.stats.vm.v_kthreads
vm.stats.vm.v_ozfod
vm.stats.vm.v_pdpages
vm.stats.vm.v_pdwakeups
vm.stats.vm.v_pfree
vm.stats.vm.v_reactivated
vm.stats.vm.v_rforks
vm.stats.vm.v_swapin
vm.stats.vm.v_swapout
vm.stats.vm.v_swappgsin
vm.stats.vm.v_swappgsout
vm.stats.vm.v_tfree
vm.stats.vm.v_vforkpages
vm.stats.vm.v_vforks
vm.stats.vm.v_vm_faults
vm.stats.vm.v_vnodein
vm.stats.vm.v_vnodeout
vm.stats.vm.v_vnodepgsin
vm.stats.vm.v_vnodepgsout
vm.stats.vm.v_wire_count
vm.stats.vm.v_zfod
vm.swap_idle_threshold1
vm.swap_idle_threshold2


I admit that I don't know what several of those actually mean: I figured
I'd capture what I can, then try to make sense of it.  It's very easy to
ignore data that I've captured, but don't need; it's a little harder to take
appropriate corrective action if I determine that there was some
information I should have captured, but didn't.  :-}

Still, if something's in there that's just silly, I wouldn't mind knowing
about it.  :-)

Thanks!

Peace,
david
--
David H. Wolfskill [hidden email]
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

attachment0 (202 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

Brian Scott-30
David Wolfskill wrote:

> At $work, I've been trying to gather information on "interesting
> patterns" of resource consumption during moderately long-running (5 - 8
> hour) tasks; the hosts in question usually run FreeBSD 6.2, though
> there's an occasional 6.x that's more recent, as well as a bit of
> 7-STABLE.
>
> I wanted to have a low impact on the system being measured (of course),
> and I was unwilling to require that a system to be measured had any
> software installed on it other than base FreeBSD.  (Yes, that means I
> didn't assume Perl, though in practice in this environment, each does.)
>
> I also wanted the data to be transferred reasonably securely, even if
> part of that transit was over facilities over which I had no control.
> (Some of the machines being measured happen to be in a continent other
> than where I am.)
>
> So I cobbled up a Perl script to run on a data-gathering machine (that
> one was mine, so I could require that it had any software I wanted on
> it); it acts (if you will) as a "shepherd," watching over child
> processes, one of which is created for each host to be measured.
>
> A given child process copies over a shell script to the remote machine,
> then redirects STDOUT to append to a file on the data-gathering machine,
> and exec()s ssh(1), telling it to run the shell script on the remote
> machine.
>
> The shell script fabricates a string (depending on the arguments with
> which it was invoked), then sits in a loop:
>
> * eval the string
> * sleep for the amount of time remaining
>
> indefinitely.  (In practice, the usual nominal time between successive
> eval()s is 5 minutes.  I have recently been doing some experiments at a
> 10-second interval.)
>
> Periodically, back on the data-gathering machine, a couple of different
> things happen:
>
> * The "shepherd" script wakes up and checks the mtime on the file for
>   each per-host process (to see if it's been updated "sufficiently
>   recently").  Acttually, it first checks the file that lists the hosts
>   to watch; if its mtime has changed, it's re-read, and the list of
>   hosts is modified as appropriate.  Anyway, if a given per-host file is
>   "too old," the corresponding child process is killed.  The the
>   script runs through the list of hosts that should be checked,
>   creating a per-host process for each one for which that's necessary.
>
>   There's a fair amount of detail I'm eliding (such as limited
>   exponential backoff for unresponsive hosts).
>
>   In practice, this runs every 2 minutes at the moment.
>
> * There's a cron(8)-initiated make(1) process that runs, reading the
>   files created by the per-host processes and writing to a corresponding
>   RRD.  (I cobbled up a Perl script to do this.)
>
> While I tried to externalize a fair amount of this -- e.g., the list of
> sysctl(1) OIDs to use is read from an external file -- it turns out that
> certain types of change are a bit ... painful.  In particular, adding a
> new "data source" to the RRD qualifies (as "painful").
>
> I recently modified the scripts involved to allow them to also be used
> to gather per-NIC statistics (via invocation of "netstat -nibf inet").
>
> I'm about to implement that change over the weekend, so it occurred to
> me that this might be a good time to add some more sysctl(1) OIDs.
>
> So I'm asking for suggestions -- ideally, for OIDs that are fairly
> easily parseable.  (I started being limited to only OIDs that were
> presented as a single numeric value per line, then figured out how to
> handle kern.cp_time (which is an ordered quintuple); later I figured out
> how to cope with vm.loadavg (which is an order triplet ... surrounded by
> curly braces).  I don't currently have logic to cope with anything more
> complicated than those.)
>
> Here's a list of the OIDs I'm currently using:
>
-------- Snip ---------

>
>
> I admit that I don't know what several of those actually mean: I figured
> I'd capture what I can, then try to make sense of it.  It's very easy to
> ignore data that I've captured, but don't need; it's a little harder to take
> appropriate corrective action if I determine that there was some
> information I should have captured, but didn't.  :-}
>
> Still, if something's in there that's just silly, I wouldn't mind knowing
> about it.  :-)
>
> Thanks!
>
> Peace,
> david

You may be interested in some software that I've written over the last 5
years or so called FreePDB. Its written in Perl and has a requirement
for an XML library to be installed. This sort of breaks your first
requirement but I'll describe it anyway.

I schedule a program to run regularly with cron. The program reads some
configuration data from an XML file telling it what needs to be
collected (and what mechanisms to use to collect it) and issues the
necessary commands (sysctl is definitely one of the possibilities) and
spits out rows into one or more text files.

In your case, I expect you would transfer the text files over to a
central system (the logger just creates a new file if someone steals the
old one), where another program loads the text files into database tables.

Graphing support includes the possibility to extract data into an rrd
file, as well as driving gnuplot or some Perl GD::Graph stuff, or even
hooking up Excel with ODBC from a Windows box and using the graph wizard.

Anyway, I just thought I'd mention it since it might save you some work.

It can be found at freepdb.sourceforge.net. It definitely runs on
FreeBSD (I recently upgraded a 4.7 machine but before that it ran there
quite nicely) including 7.0.

I'm just cleaning up a new release that includes choice of database
systems and a few performance/usability improvements. As they say in the
classics, "If you don't see what you need, just ask".

Regards,

Brian
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

Norberto Meijome-6
In reply to this post by David Wolfskill
On Fri, 12 Sep 2008 16:48:22 -0700
David Wolfskill <[hidden email]> wrote:

> I wanted to have a low impact on the system being measured (of course),
> and I was unwilling to require that a system to be measured had any
> software installed on it other than base FreeBSD.  (Yes, that means I
> didn't assume Perl, though in practice in this environment, each does.)

Out of curiosity, how does bsnmpd compare to your approach with regards to
impact on the system. It is part of 7.0 , not sure about previous versions, and
it is definitely a more standard and cross platform approach , with support @
NOC / alerting side of things.

(for what is worth, i've only used net-snmpd , not bsnmpd )...

B
_________________________
{Beto|Norberto|Numard} Meijome

"Whenever you find that you are on the side of the majority, it is time to
reform." Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

David Wolfskill
On Sun, Sep 14, 2008 at 09:11:36PM +1000, Norberto Meijome wrote:
> ...
> Out of curiosity, how does bsnmpd compare to your approach with regards to
> impact on the system. It is part of 7.0 , not sure about previous versions, and
> it is definitely a more standard and cross platform approach , with support @
> NOC / alerting side of things.
>
> (for what is worth, i've only used net-snmpd , not bsnmpd )...

Understood.  As I understand it, an SNMP daemon (whether bsnmpd or
net-snmpd) would require some configuration on the remote host, and I
wasn't willing to require that.

Also, the only times I have used SNMP, it has been using a version that
did not support encryption in any form (as for as I know), and since
some of the transit was over facilities we don't control, I thought it
would be a bit more sensible to use SSH for the transport.

There is a moderate amount of work in setting up the SSH connection in
the first place: the first version of my script actually had the
"shepherd" script establish a new SSH connection to each remote host
every 5 minutes; examing a ktrace of that convinced me that SSH session
creation was not something I wanted to do on a frequent basis for a
mechanism that was intended to be low impact.

But keeping that SSH session around and "squirting" a little over 800
bytes of payload down the pipe every 5 minutes -- or even every 10
seconds -- shouldn't be too much impact.  (As a colleague pointed out,
that's probably less impact than running top(1) has.)

Granted, this isn't intended for the one "shepherd" script to deal with
thousands of remote hosts -- but I believe that "hundreds" is feasible.

Mind, I'm not especially keen on re-inventing stuff that already works
(or can be reasonably persuaded to work).  But in this case, running an
SNMP daemon seemed to fail to meet my (admittedly, somewhat self-
imposed) requirements.

Peace,
david
--
David H. Wolfskill [hidden email]
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

attachment0 (202 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

Valerio daelli-2
On Sun, Sep 14, 2008 at 2:07 PM, David Wolfskill <[hidden email]> wrote:
> On Sun, Sep 14, 2008 at 09:11:36PM +1000, Norberto Meijome wrote:
>> ...

Hi

I was thinking about extending net-snmp to gather some resource consumption
data, (read-only MIBS).
I'l post a PR as soon as I have a working patch.

Valerio
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

Norberto Meijome-6
In reply to this post by David Wolfskill
On Sun, 14 Sep 2008 05:07:49 -0700
David Wolfskill <[hidden email]> wrote:

> On Sun, Sep 14, 2008 at 09:11:36PM +1000, Norberto Meijome wrote:
> > ...
> > Out of curiosity, how does bsnmpd compare to your approach with regards to
> > impact on the system. It is part of 7.0 , not sure about previous versions,
> > and it is definitely a more standard and cross platform approach , with
> > support @ NOC / alerting side of things.
> >
> > (for what is worth, i've only used net-snmpd , not bsnmpd )...
>
> Understood.  As I understand it, an SNMP daemon (whether bsnmpd or
> net-snmpd) would require some configuration on the remote host, and I
> wasn't willing to require that.

fair enough. I don't know about the default config of bsnmpd, but "default" in
net-smpd, IIRC, means you access as public, pretty open. Not sure if there are
MIBs for the information you need though.

> Also, the only times I have used SNMP, it has been using a version that
> did not support encryption in any form (as for as I know), and since
> some of the transit was over facilities we don't control, I thought it
> would be a bit more sensible to use SSH for the transport.

but do you use encryption with your current system?

[...]
> Mind, I'm not especially keen on re-inventing stuff that already works
> (or can be reasonably persuaded to work).  But in this case, running an
> SNMP daemon seemed to fail to meet my (admittedly, somewhat self-
> imposed) requirements.

hey , your requirements are yours :) I was just curious to know why snmp didnt
cut it.
B
_________________________
{Beto|Norberto|Numard} Meijome

"Gravity cannot be blamed for people falling in love."
  Albert Einstein

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Using sysctl(1) to gather resource consumption data

David Wolfskill
On Thu, Sep 18, 2008 at 12:39:27AM +1000, Norberto Meijome wrote:
> ...
> > Also, the only times I have used SNMP, it has been using a version that
> > did not support encryption in any form (as for as I know), and since
> > some of the transit was over facilities we don't control, I thought it
> > would be a bit more sensible to use SSH for the transport.
>
> but do you use encryption with your current system?

Since it uses SSH for transport, yes.  And it uses authentication, too
(for the same reason).

> [...]
> > Mind, I'm not especially keen on re-inventing stuff that already works
> > (or can be reasonably persuaded to work).  But in this case, running an
> > SNMP daemon seemed to fail to meet my (admittedly, somewhat self-
> > imposed) requirements.
>
> hey , your requirements are yours :) I was just curious to know why snmp didnt
> cut it.

Fair enough.  :-}

Peace,
david
--
David H. Wolfskill [hidden email]
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

attachment0 (202 bytes) Download Attachment