6.1 busy server periodically hangs, waits, then recovers a couple minutes later - analysis?
I've done some searching but this particular random and temporary lockup
condition that I'm experiencing doesn't seem to happen that much...anyways,
here goes with my symptoms and I was hoping someone could guide me towards
some add'l testing or stats I can gather to help pinpoint the root cause.
The symptoms are as follows during this indefinite frozen condition:
- Existing shell's will continue to be responsive
- Existing sessions such as http download continue to work
- Programs running within shells such as vmstat/systat/iostat, etc..
continue to spit out data
- *New* incoming socket requests or commands executed on the shell will sit
there indefinitely and come return an established connection or execute said
command several minutes later when the system returns to life.
I tried messing with the default moderated polling settings for the em
driver thinking the large number of interrupts coming from the NIC might
possibly have something to do with it but so far no change.
This is scary, does the GIANT-LOCKED mean that this storage subsystem driver
locks the entire kernel when it does I/O calls? (sorry I'm a little sketched
out reading about the random bits of freebsd6 that don't yet use finer
/var/run/dmesg.boot:mpt0: <LSILogic SAS Adapter> port 0x3000-0x30ff mem
0xc8310000-0xc8313fff,0xc8300000-0xc830ffff irq 24 at device 1.0 on pci5
/var/run/dmesg.boot:mpt0: MPI Version=126.96.36.199
At first I was thinking the single NIC and single CPU that gets hit from it
was draining to 0% idle but I don't think it's really related and wouldn't
explain why new processes couldn't run on new sessions, etc..
as for dmesg errors, all we see are these errors that pop every minute or
I have not used WITNESS before but would this be a good time to start
looking? Is the server simply too busy? What else could I look for or try
tweaking to get around this problem that doesn't happen at lower off-peak