4.8 "alternate system clock has died" error

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

4.8 "alternate system clock has died" error

Charles Sprickman
Hello all,

I've been digging through Google for more information on this.  I have a
4.8 box that's been up for about 430 days.  In the last week or so, top
and ps have started reporting all CPU usage numbers as zero, and running
"systat -vmstat" results in the message "The alternate system clock has
died! Reverting to ``pigs'' display".

I've found instances of this message in the archives for some 3.x users,
some pre 4.8 users and some 5.3 users.

There were a number of suggestions including a patch if pre-4.8, sending
init a HUP, and setting the following sysctl mib:
"kern.timecounter.method: 1".

I'm already at 4.8-p24, so I did not look into patching anything, and
HUP'ing init and setting the sysctl mib does not seem to have any effect.

I'm not quite ready to believe that some hardware has actually failed.
Perhaps due to the long uptime something has rolled over?

Let me know what info you would like, I can supply anything.  I'm
following this with some mainboard info from dmidecode and a full dmesg.

Thanks,

Charles

dmidecode:

Handle 0x0002
         DMI type 2, 8 bytes.
         Base Board Information
                 Manufacturer: Tyan
                 Product Name: S2462 THUNDER K7
                 Version: EVT1
                 Serial Number:

dmesg:

Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
         The Regents of the University of California. All rights reserved.
FreeBSD 4.8-RELEASE-p24 #0: Sun Sep 19 08:44:43 GMT 2004
     root@:/usr/obj/usr/src/sys/XENA
Timecounter "i8254"  frequency 1193182 Hz
CPU: AMD Athlon(tm) MP 1600+ (1393.79-MHz 686-class CPU)
   Origin = "AuthenticAMD"  Id = 0x662  Stepping = 2

Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
   AMD Features=0xc0480000<MP,AMIE,DSP,3DNow!>
real memory  = 1073676288 (1048512K bytes)
avail memory = 1041649664 (1017236K bytes)
Programming 24 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
FreeBSD/SMP: Multiprocessor motherboard
  cpu0 (BSP): apic id:  1, version: 0x00040010, at 0xfee00000
  cpu1 (AP):  apic id:  0, version: 0x00040010, at 0xfee00000
  io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec00000
Preloaded elf kernel "kernel" at 0xc035e000.
Preloaded userconfig_script "/boot/kernel.conf" at 0xc035e09c.
Pentium Pro MTRR support enabled
md0: Malloc disk
Using $PIR table, 268435454 entries at 0xc00fdef0
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
IOAPIC #0 intpin 19 -> irq 2
IOAPIC #0 intpin 16 -> irq 5
IOAPIC #0 intpin 17 -> irq 9
IOAPIC #0 intpin 18 -> irq 10
pci0: <PCI bus> on pcib0
agp0: <AMD 762 host to AGP bridge> mem
0xfc000000-0xfc000fff,0xf8000000-0xfbffffff at device 0.0 on pci0
pcib1: <PCI to PCI bridge (vendor=1022 device=700d)> at device 1.0 on pci0
pci1: <PCI bus> on pcib1
isab0: <PCI to ISA bridge (vendor=1022 device=7410)> at device 7.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <AMD 766 ATA100 controller> port 0xf000-0xf00f at device 7.1 on
pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
chip0: <PCI to Other bridge (vendor=1022 device=7413)> at device 7.3 on
pci0
pci0: <OHCI USB controller> at 7.4 irq 2
asr0: <Adaptec Caching SCSI RAID> mem 0xf6000000-0xf7ffffff irq 5 at
device 8.0 on pci0
asr0: major=154
asr0: ADAPTEC 2110S FW Rev. 380E, 1 channel, 256 CCBs, Protocol I2O
pcib2: <PCI to PCI bridge (vendor=1044 device=a500)> at device 8.1 on pci0
pci2: <PCI bus> on pcib2
ahc0: <Adaptec aic7899 Ultra160 SCSI adapter> port 0x1000-0x10ff mem
0xf4001000-0xf4001fff irq 5 at device 13.0 on pci0
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
ahc1: <Adaptec aic7899 Ultra160 SCSI adapter> port 0x1400-0x14ff mem
0xf4002000-0xf4002fff irq 9 at device 13.1 on pci0
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
pci0: <ATI Mach64-GR graphics accelerator> at 14.0
xl0: <3Com 3c980C Fast Etherlink XL> port 0x1c00-0x1c7f mem
0xf4004000-0xf400407f irq 10 at device 15.0 on pci0
xl0: Ethernet address: 00:e0:81:20:07:06
miibus0: <MII bus> on xl0
ukphy0: <Generic IEEE 802.3u media interface> on miibus0
ukphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
xl1: <3Com 3c980C Fast Etherlink XL> port 0x1c80-0x1cff mem
0xf4004400-0xf400447f irq 2 at device 16.0 on pci0
xl1: Ethernet address: 00:e0:81:20:07:07
miibus1: <MII bus> on xl1
ukphy1: <Generic IEEE 802.3u media interface> on miibus1
ukphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
orm0: <Option ROMs> at iomem
0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xc
9000-0xcefff,0xe0000-0xe3fff on isa0
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1: configured irq 3 not in bitmap of probed irqs 0
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: routing 8254 via IOAPIC #0 intpin 2
IPsec: Initialized Security Association Processing.
IP Filter: v3.4.31 initialized.  Default = pass all, Logging = enabled
SMP: AP CPU #1 Launched!
acd0: CDROM <CDU5211> at ata0-master PIO4
Waiting 5 seconds for SCSI devices to settle
Mounting root from ufs:/dev/da0s1a
da0 at asr0 bus 0 target 3 lun 0
da0: <ADAPTEC RAID-1 380E> Fixed Direct Access SCSI-2 device
da0: Tagged Queueing Enabled
da0: 35003MB (71686144 512 byte sectors: 255H 63S/T 4462C)

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: 4.8 "alternate system clock has died" error

Uwe Doering
Charles Sprickman wrote:

> Hello all,
>
> I've been digging through Google for more information on this.  I have a
> 4.8 box that's been up for about 430 days.  In the last week or so, top
> and ps have started reporting all CPU usage numbers as zero, and running
> "systat -vmstat" results in the message "The alternate system clock has
> died! Reverting to ``pigs'' display".
>
> I've found instances of this message in the archives for some 3.x users,
> some pre 4.8 users and some 5.3 users.
>
> There were a number of suggestions including a patch if pre-4.8, sending
> init a HUP, and setting the following sysctl mib:
> "kern.timecounter.method: 1".
>
> I'm already at 4.8-p24, so I did not look into patching anything, and
> HUP'ing init and setting the sysctl mib does not seem to have any effect.
>
> I'm not quite ready to believe that some hardware has actually failed.
> Perhaps due to the long uptime something has rolled over?

We had this once at work, quite a while ago.  The "alternate system
clock" is in fact the Real Time Clock (RTC) on the mainboard.  In our
case we were lucky in that it was just the quartz device that failed due
to an improperly soldered lead which finally came off.  We fixed the
soldering and the problem was gone.

Now, there are of course plenty of other hardware reasons why the RTC
can fail, even temporarily like in your case.  Perhaps it is really time
for a new mainboard.

    Uwe
--
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
[hidden email]  |  http://www.escapebox.net
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: 4.8 "alternate system clock has died" error

Charles Sprickman
On Fri, 18 Nov 2005, Uwe Doering wrote:

> Charles Sprickman wrote:
>> Hello all,
>>
>> I've been digging through Google for more information on this.  I have a
>> 4.8 box that's been up for about 430 days.  In the last week or so, top and
>> ps have started reporting all CPU usage numbers as zero, and running
>> "systat -vmstat" results in the message "The alternate system clock has
>> died! Reverting to ``pigs'' display".
>>
>> I've found instances of this message in the archives for some 3.x users,
>> some pre 4.8 users and some 5.3 users.
>>
>> There were a number of suggestions including a patch if pre-4.8, sending
>> init a HUP, and setting the following sysctl mib: "kern.timecounter.method:
>> 1".
>>
>> I'm already at 4.8-p24, so I did not look into patching anything, and
>> HUP'ing init and setting the sysctl mib does not seem to have any effect.
>>
>> I'm not quite ready to believe that some hardware has actually failed.
>> Perhaps due to the long uptime something has rolled over?
>
> We had this once at work, quite a while ago.  The "alternate system clock" is
> in fact the Real Time Clock (RTC) on the mainboard.  In our case we were
> lucky in that it was just the quartz device that failed due to an improperly
> soldered lead which finally came off.  We fixed the soldering and the problem
> was gone.

Are there any tools to verify that the RTC is working?  I don't exactly
understand what the RTC is, but would the machine not be suffering some
other problems if there was an actual hardware failure?  Doesn't the
system rely on this to time everything from the processors to memory to
PCI slots and interrupts?

Is there any simple way to figure out if this is hardware or software?

> Now, there are of course plenty of other hardware reasons why the RTC can
> fail, even temporarily like in your case.  Perhaps it is really time for a
> new mainboard.

Ouch, that would hurt.  This machine does not have much room for tinkering
(mail server).

Thanks,

Charles

>   Uwe
> --
> Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
> [hidden email]  |  http://www.escapebox.net
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: 4.8 "alternate system clock has died" error

Uwe Doering
Charles Sprickman wrote:

> On Fri, 18 Nov 2005, Uwe Doering wrote:
>> Charles Sprickman wrote:
>>
>>> I've been digging through Google for more information on this.  I
>>> have a 4.8 box that's been up for about 430 days.  In the last week
>>> or so, top and ps have started reporting all CPU usage numbers as
>>> zero, and running "systat -vmstat" results in the message "The
>>> alternate system clock has died! Reverting to ``pigs'' display".
>>> [...]
>>
>> We had this once at work, quite a while ago.  The "alternate system
>> clock" is in fact the Real Time Clock (RTC) on the mainboard.  In our
>> case we were lucky in that it was just the quartz device that failed
>> due to an improperly soldered lead which finally came off.  We fixed
>> the soldering and the problem was gone.
>
> Are there any tools to verify that the RTC is working?

"systat -vmstat" will show you the interrupt that it drives.  In our
case it's irq8, which is in fact labeled "rtc".  It is supposed to run
at 128 Hz.  Under load it can drop to some lower value.  This is normal.

> I don't exactly
> understand what the RTC is, but would the machine not be suffering some
> other problems if there was an actual hardware failure?  Doesn't the
> system rely on this to time everything from the processors to memory to
> PCI slots and interrupts?

No, the RTC drives only the interrupt that is responsible for collecting
the CPU usage data.  When it fails the CPU usage in "top", "ps" etc.
just drops to zero, as you've observed, but the server continues to run.
  If the failure is permanent the machine refuses to boot, though.  At
least that's what happened in our case.  Apparently the RTC chip is
essential to the mainboard's boot sequence.  For instance, the initial
date and time information comes from this chip.

On the other hand, if a reset corrects the problem then the RTC chip
probably got hung, or there is a problem with the interrupt controller
it is connected to.  On a properly working mainboard this shouldn't
happen, of course.

> Is there any simple way to figure out if this is hardware or software?

I don't know of any.  However, we run FreeBSD almost since 4.0, on
various mainboards, UP and SMP, and we've never seen these symptoms but
in this one case mentioned above.  So I suppose it's not a kernel bug.
I haven't looked at the PR database, though.

    Uwe
--
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
[hidden email]  |  http://www.escapebox.net
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[hidden email]"