watchdogd stat location

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

watchdogd stat location

mdtancsa
We sometimes run into an issue where our embedded devices that boot off
a read only SD card, hang with a controller error.  Its not clear if
this is a BIOS/firmware/SDCard or driver bug issue. Its pretty
infrequent, but annoying enough that I started to dig to see why the box
is not rebooting via the hardware watchdog.   It seems to fail to reboot
the box as the stat that it does on the filesystem is off the md backed
/etc which is not impacted.  I know I could run an external program, but
would it be safer to change the default directory where the stat is done
to something that is generally not mounted via ramdisk ?

eg.


 diff -u watchdogd.c.orig watchdogd.c
--- watchdogd.c.orig    2019-09-27 10:51:04.273113000 -0400
+++ watchdogd.c 2019-09-27 10:51:23.592200000 -0400
@@ -365,7 +365,7 @@
                if (test_cmd != NULL)
                        failed = system(test_cmd);
                else
-                       failed = stat("/etc", &sb);
+                       failed = stat("/boot", &sb);
 
                error = watchdog_getuptime(&ts_end);
                if (error) {


    ---Mike

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Warner Losh
On Fri, Sep 27, 2019 at 9:36 AM mike tancsa <[hidden email]> wrote:

> We sometimes run into an issue where our embedded devices that boot off
> a read only SD card, hang with a controller error.  Its not clear if
> this is a BIOS/firmware/SDCard or driver bug issue. Its pretty
> infrequent, but annoying enough that I started to dig to see why the box
> is not rebooting via the hardware watchdog.   It seems to fail to reboot
> the box as the stat that it does on the filesystem is off the md backed
> /etc which is not impacted.  I know I could run an external program, but
> would it be safer to change the default directory where the stat is done
> to something that is generally not mounted via ramdisk ?
>
> eg.
>
>
>  diff -u watchdogd.c.orig watchdogd.c
> --- watchdogd.c.orig    2019-09-27 10:51:04.273113000 -0400
> +++ watchdogd.c 2019-09-27 10:51:23.592200000 -0400
> @@ -365,7 +365,7 @@
>                 if (test_cmd != NULL)
>                         failed = system(test_cmd);
>                 else
> -                       failed = stat("/etc", &sb);
> +                       failed = stat("/boot", &sb);
>
>                 error = watchdog_getuptime(&ts_end);
>                 if (error) {
>

I think this is good.

Warner
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Aleksandr Rybalko-3
пт, 27 вер. 2019 о 19:22 Warner Losh <[hidden email]> пише:

> On Fri, Sep 27, 2019 at 9:36 AM mike tancsa <[hidden email]> wrote:
>
> > We sometimes run into an issue where our embedded devices that boot off
> > a read only SD card, hang with a controller error.  Its not clear if
> > this is a BIOS/firmware/SDCard or driver bug issue. Its pretty
> > infrequent, but annoying enough that I started to dig to see why the box
> > is not rebooting via the hardware watchdog.   It seems to fail to reboot
> > the box as the stat that it does on the filesystem is off the md backed
> > /etc which is not impacted.  I know I could run an external program, but
> > would it be safer to change the default directory where the stat is done
> > to something that is generally not mounted via ramdisk ?
> >
> > eg.
> >
> >
> >  diff -u watchdogd.c.orig watchdogd.c
> > --- watchdogd.c.orig    2019-09-27 10:51:04.273113000 -0400
> > +++ watchdogd.c 2019-09-27 10:51:23.592200000 -0400
> > @@ -365,7 +365,7 @@
> >                 if (test_cmd != NULL)
> >                         failed = system(test_cmd);
> >                 else
> > -                       failed = stat("/etc", &sb);
> > +                       failed = stat("/boot", &sb);
> >
> >                 error = watchdog_getuptime(&ts_end);
> >                 if (error) {
> >
>
> I think this is good.
>
> Warner
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
> To unsubscribe, send any mail to "[hidden email]
> "
>

Why not just stat "/".

I think embedded devices may have monolithic kernel w/o any loadable
modules and book config.

--
WBW
-------
Rybalko Aleksandr <[hidden email]>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

mdtancsa
On 9/27/2019 3:19 PM, Oleksandr Rybalko wrote:

>
> пт, 27 вер. 2019 о 19:22 Warner Losh <[hidden email]
> <mailto:[hidden email]>> пише:
>
>     On Fri, Sep 27, 2019 at 9:36 AM mike tancsa <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>     > We sometimes run into an issue where our embedded devices that
>     boot off
>     > a read only SD card, hang with a controller error.  Its not clear if
>     > this is a BIOS/firmware/SDCard or driver bug issue. Its pretty
>     > infrequent, but annoying enough that I started to dig to see why
>     the box
>     > is not rebooting via the hardware watchdog.   It seems to fail
>     to reboot
>     > the box as the stat that it does on the filesystem is off the md
>     backed
>     > /etc which is not impacted.  I know I could run an external
>     program, but
>     > would it be safer to change the default directory where the stat
>     is done
>     > to something that is generally not mounted via ramdisk ?
>
>
> Why not just stat "/".
>
> I think embedded devices may have monolithic kernel w/o any loadable
> modules and book config.
>
I am all for that too. Just something other than /etc or /var which are
often mounted on ramdisk.

    ---Mike

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Warner Losh
On Fri, Sep 27, 2019, 1:21 PM mike tancsa <[hidden email]> wrote:

> On 9/27/2019 3:19 PM, Oleksandr Rybalko wrote:
> >
> > пт, 27 вер. 2019 о 19:22 Warner Losh <[hidden email]
> > <mailto:[hidden email]>> пише:
> >
> >     On Fri, Sep 27, 2019 at 9:36 AM mike tancsa <[hidden email]
> >     <mailto:[hidden email]>> wrote:
> >
> >     > We sometimes run into an issue where our embedded devices that
> >     boot off
> >     > a read only SD card, hang with a controller error.  Its not clear
> if
> >     > this is a BIOS/firmware/SDCard or driver bug issue. Its pretty
> >     > infrequent, but annoying enough that I started to dig to see why
> >     the box
> >     > is not rebooting via the hardware watchdog.   It seems to fail
> >     to reboot
> >     > the box as the stat that it does on the filesystem is off the md
> >     backed
> >     > /etc which is not impacted.  I know I could run an external
> >     program, but
> >     > would it be safer to change the default directory where the stat
> >     is done
> >     > to something that is generally not mounted via ramdisk ?
> >
> >
> > Why not just stat "/".
> >
> > I think embedded devices may have monolithic kernel w/o any loadable
> > modules and book config.
> >
> I am all for that too. Just something other than /etc or /var which are
> often mounted on ramdisk.
>

I think that / is too special to cause disk IO to ever happen. Other dirs
will sometimes not be in the cache.... The notion here, perhaps bogus, is
that we want to check the root FS is sane. The stat(2) is a cheap way to do
this that will eventually fail if / goes wonky enough. It's weak.

Warner

    ---Mike
>
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
> To unsubscribe, send any mail to "[hidden email]
> "
>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

mdtancsa
On 9/27/2019 3:53 PM, Warner Losh wrote:

> >
>
>     I am all for that too. Just something other than /etc or /var
>     which are
>     often mounted on ramdisk.
>
>
> I think that / is too special to cause disk IO to ever happen. Other
> dirs will sometimes not be in the cache.... The notion here, perhaps
> bogus, is that we want to check the root FS is sane. The stat(2) is a
> cheap way to do this that will eventually fail if / goes wonky enough.
> It's weak.
>
>
Would something like this buy any extra sanity ? or not worth it. I
guess fancier checks belong in a passed program


# diff -u watchdogd.c.orig watchdogd.c
--- watchdogd.c.orig    2019-09-27 16:27:14.456973000 -0400
+++ watchdogd.c 2019-09-27 16:27:18.904885000 -0400
@@ -364,9 +364,23 @@
 
                if (test_cmd != NULL)
                        failed = system(test_cmd);
-               else
-                       failed = stat("/etc", &sb);
-
+               else {
+
+                       srand(time(NULL));
+                       switch(rand() % 4) {
+                               case 0:
+                                       failed = stat("/", &sb);
+                                       break;
+                               case 1:
+                                       failed = stat("/bin", &sb);
+                                       break;
+                               case 2:
+                                       failed = stat("/sbin", &sb);
+                                       break;
+                               default:
+                                       failed = stat("/usr", &sb);
+                       }
+               }
                error = watchdog_getuptime(&ts_end);
                if (error) {
                        end_program = 1;




_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Aleksandr Rybalko-3
/sbin and /usr may fail in many case.
Maybe readdir, then rand stat?

пт, 27 вер. 2019 о 23:30 mike tancsa <[hidden email]> пише:

> On 9/27/2019 3:53 PM, Warner Losh wrote:
> > >
> >
> >     I am all for that too. Just something other than /etc or /var
> >     which are
> >     often mounted on ramdisk.
> >
> >
> > I think that / is too special to cause disk IO to ever happen. Other
> > dirs will sometimes not be in the cache.... The notion here, perhaps
> > bogus, is that we want to check the root FS is sane. The stat(2) is a
> > cheap way to do this that will eventually fail if / goes wonky enough.
> > It's weak.
> >
> >
> Would something like this buy any extra sanity ? or not worth it. I
> guess fancier checks belong in a passed program
>
>
> # diff -u watchdogd.c.orig watchdogd.c
> --- watchdogd.c.orig    2019-09-27 16:27:14.456973000 -0400
> +++ watchdogd.c 2019-09-27 16:27:18.904885000 -0400
> @@ -364,9 +364,23 @@
>
>                 if (test_cmd != NULL)
>                         failed = system(test_cmd);
> -               else
> -                       failed = stat("/etc", &sb);
> -
> +               else {
> +
> +                       srand(time(NULL));
> +                       switch(rand() % 4) {
> +                               case 0:
> +                                       failed = stat("/", &sb);
> +                                       break;
> +                               case 1:
> +                                       failed = stat("/bin", &sb);
> +                                       break;
> +                               case 2:
> +                                       failed = stat("/sbin", &sb);
> +                                       break;
> +                               default:
> +                                       failed = stat("/usr", &sb);
> +                       }
> +               }
>                 error = watchdog_getuptime(&ts_end);
>                 if (error) {
>                         end_program = 1;
>
>
>
>
>

--
WBW
-------
Rybalko Aleksandr <[hidden email]>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Warner Losh
In reply to this post by mdtancsa
On Fri, Sep 27, 2019 at 2:30 PM mike tancsa <[hidden email]> wrote:

> On 9/27/2019 3:53 PM, Warner Losh wrote:
> > >
> >
> >     I am all for that too. Just something other than /etc or /var
> >     which are
> >     often mounted on ramdisk.
> >
> >
> > I think that / is too special to cause disk IO to ever happen. Other
> > dirs will sometimes not be in the cache.... The notion here, perhaps
> > bogus, is that we want to check the root FS is sane. The stat(2) is a
> > cheap way to do this that will eventually fail if / goes wonky enough.
> > It's weak.
> >
> >
> Would something like this buy any extra sanity ? or not worth it. I
> guess fancier checks belong in a passed program
>
>
> # diff -u watchdogd.c.orig watchdogd.c
> --- watchdogd.c.orig    2019-09-27 16:27:14.456973000 -0400
> +++ watchdogd.c 2019-09-27 16:27:18.904885000 -0400
> @@ -364,9 +364,23 @@
>
>                 if (test_cmd != NULL)
>                         failed = system(test_cmd);
> -               else
> -                       failed = stat("/etc", &sb);
> -
> +               else {
> +
> +                       srand(time(NULL));
> +                       switch(rand() % 4) {
> +                               case 0:
> +                                       failed = stat("/", &sb);
> +                                       break;
> +                               case 1:
> +                                       failed = stat("/bin", &sb);
> +                                       break;
> +                               case 2:
> +                                       failed = stat("/sbin", &sb);
> +                                       break;
> +                               default:
> +                                       failed = stat("/usr", &sb);
> +                       }
> +               }
>                 error = watchdog_getuptime(&ts_end);
>                 if (error) {
>                         end_program = 1;
>

I don't think the rand helps at all. I think you'd really rather do things
sequentially. And this introduces more assumptions about the underlying
filesystem(s).

Warner
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

Ian Lepore-3
On Fri, 2019-09-27 at 15:31 -0600, Warner Losh wrote:

> On Fri, Sep 27, 2019 at 2:30 PM mike tancsa <[hidden email]> wrote:
>
> > On 9/27/2019 3:53 PM, Warner Losh wrote:
> > > >
> > >
> > >     I am all for that too. Just something other than /etc or /var
> > >     which are
> > >     often mounted on ramdisk.
> > >
> > >
> > > I think that / is too special to cause disk IO to ever happen.
> > > Other
> > > dirs will sometimes not be in the cache.... The notion here,
> > > perhaps
> > > bogus, is that we want to check the root FS is sane. The stat(2)
> > > is a
> > > cheap way to do this that will eventually fail if / goes wonky
> > > enough.
> > > It's weak.
> > >
> > >
> >
> > Would something like this buy any extra sanity ? or not worth it. I
> > guess fancier checks belong in a passed program
> >
> >
> > # diff -u watchdogd.c.orig watchdogd.c
> > --- watchdogd.c.orig    2019-09-27 16:27:14.456973000 -0400
> > +++ watchdogd.c 2019-09-27 16:27:18.904885000 -0400
> > @@ -364,9 +364,23 @@
> >
> >                 if (test_cmd != NULL)
> >                         failed = system(test_cmd);
> > -               else
> > -                       failed = stat("/etc", &sb);
> > -
> > +               else {
> > +
> > +                       srand(time(NULL));
> > +                       switch(rand() % 4) {
> > +                               case 0:
> > +                                       failed = stat("/", &sb);
> > +                                       break;
> > +                               case 1:
> > +                                       failed = stat("/bin", &sb);
> > +                                       break;
> > +                               case 2:
> > +                                       failed = stat("/sbin",
> > &sb);
> > +                                       break;
> > +                               default:
> > +                                       failed = stat("/usr", &sb);
> > +                       }
> > +               }
> >                 error = watchdog_getuptime(&ts_end);
> >                 if (error) {
> >                         end_program = 1;
> >
>
> I don't think the rand helps at all. I think you'd really rather do
> things
> sequentially. And this introduces more assumptions about the
> underlying
> filesystem(s).
>
> Warner
>

If we want to be sure to force physical IO, how about dd if=/
of=/dev/null count=1 ?

But I question the premise of forcing physical IO as being somehow a
better indicator of a non-hung system.  I think it's just a better
indicator of the sdcard problem that Mike is experiencing.  For anyone
else, forcing periodic physical IO is going to do annoying things like
spin up idle drives.

-- Ian

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

mdtancsa
On 9/28/2019 3:30 PM, Ian Lepore wrote:
> If we want to be sure to force physical IO, how about dd if=/
> of=/dev/null count=1 ?
>
> But I question the premise of forcing physical IO as being somehow a
> better indicator of a non-hung system.  I think it's just a better
> indicator of the sdcard problem that Mike is experiencing.  For anyone
> else, forcing periodic physical IO is going to do annoying things like
> spin up idle drives.


I think in my case, I am going to need to do that.  I was hoping doing a
simple stat on / or /boot would do the trick to recover from

mmcsd0: Error indicated: 1 Timeout
g_vfs_done():mmcsd0s1a[READ(offset=267358208, length=4096)]error = 5
vnode_pager_generic_getpages_done: I/O read error 5
vm_fault: pager read error, pid 1 (init)
sdhci_pci0-slot0: Got AutoCMD12 error 0x0001, but there is no active
command.
sdhci_pci0-slot0: ============== REGISTER DUMP ==============
sdhci_pci0-slot0: Sys addr: 0x74ee0000 | Version:  0x00001001
sdhci_pci0-slot0: Blk size: 0x00005200 | Blk cnt:  0x00000008
sdhci_pci0-slot0: Argument: 0x0007f817 | Trn mode: 0x00000037
sdhci_pci0-slot0: Present:  0x01ff0000 | Host ctl: 0x00000007
sdhci_pci0-slot0: Power:    0x0000000f | Blk gap:  0x00000000
sdhci_pci0-slot0: Wake-up:  0x00000000 | Clock:    0x00000007
sdhci_pci0-slot0: Timeout:  0x0000000d | Int stat: 0x00000000
sdhci_pci0-slot0: Int enab: 0x01ff00fb | Sig enab: 0x01ff00fb
sdhci_pci0-slot0: AC12 err: 0x00000001 | Host ctl2:0x00000080
sdhci_pci0-slot0: Caps:     0x21fe32b2 | Caps2:    0x00000070
sdhci_pci0-slot0: Max curr: 0x00c80064 | ADMA err: 0x00000000
sdhci_pci0-slot0: ADMA addr:0x00000000 | Slot int: 0x000000ff
sdhci_pci0-slot0: ===========================================
g_vfs_done():mmcsd0s1a[READ(offset=267358208, length=4096)]error = 5
vnode_pager_generic_getpages_done: I/O read error 5

but it looks like no dice, at least in the one case I hit over the
weekend. However from the captured logs, not sure if watchogd really got
armed or not.

I think doing an actual raw read is the way to go, but to put that in
watchdogd feels like it would violate POLA.  I think instead, I will
make it an external command as it will fix my needs, or even roll my own
watchdogd which might even be better for me.

    ---Mike

--
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, [hidden email]
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada  

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: watchdogd stat location

John-Mark Gurney-2
In reply to this post by Ian Lepore-3
Ian Lepore wrote this message on Sat, Sep 28, 2019 at 13:30 -0600:

> On Fri, 2019-09-27 at 15:31 -0600, Warner Losh wrote:
> > On Fri, Sep 27, 2019 at 2:30 PM mike tancsa <[hidden email]> wrote:
> >
> > > On 9/27/2019 3:53 PM, Warner Losh wrote:
> > > > >
> > > >
> > > >     I am all for that too. Just something other than /etc or /var
> > > >     which are
> > > >     often mounted on ramdisk.
> > > >
> > > >
> > > > I think that / is too special to cause disk IO to ever happen.
> > > > Other
> > > > dirs will sometimes not be in the cache.... The notion here,
> > > > perhaps
> > > > bogus, is that we want to check the root FS is sane. The stat(2)
> > > > is a
> > > > cheap way to do this that will eventually fail if / goes wonky
> > > > enough.
> > > > It's weak.
> > > >
> > > >
> > >
> > > Would something like this buy any extra sanity ? or not worth it. I
> > > guess fancier checks belong in a passed program
> > >
> > >
> > > # diff -u watchdogd.c.orig watchdogd.c
> > > --- watchdogd.c.orig    2019-09-27 16:27:14.456973000 -0400
> > > +++ watchdogd.c 2019-09-27 16:27:18.904885000 -0400
> > > @@ -364,9 +364,23 @@
> > >
> > >                 if (test_cmd != NULL)
> > >                         failed = system(test_cmd);
> > > -               else
> > > -                       failed = stat("/etc", &sb);
> > > -
> > > +               else {
> > > +
> > > +                       srand(time(NULL));
> > > +                       switch(rand() % 4) {
> > > +                               case 0:
> > > +                                       failed = stat("/", &sb);
> > > +                                       break;
> > > +                               case 1:
> > > +                                       failed = stat("/bin", &sb);
> > > +                                       break;
> > > +                               case 2:
> > > +                                       failed = stat("/sbin",
> > > &sb);
> > > +                                       break;
> > > +                               default:
> > > +                                       failed = stat("/usr", &sb);
> > > +                       }
> > > +               }
> > >                 error = watchdog_getuptime(&ts_end);
> > >                 if (error) {
> > >                         end_program = 1;
> > >
> >
> > I don't think the rand helps at all. I think you'd really rather do
> > things
> > sequentially. And this introduces more assumptions about the
> > underlying
> > filesystem(s).
> >
> > Warner
> >
>
> If we want to be sure to force physical IO, how about dd if=/
> of=/dev/null count=1 ?
>
> But I question the premise of forcing physical IO as being somehow a
> better indicator of a non-hung system.  I think it's just a better
> indicator of the sdcard problem that Mike is experiencing.  For anyone
> else, forcing periodic physical IO is going to do annoying things like
> spin up idle drives.

even better is to pull the device from df (zfs is a bit more complex,
but code in rc.d/growfs exists for it) and do IO directly to the device..
Then you are bypassing the disk cache entirely.

But I agree that spinning up idle drives isn't a good option...  Looks
like maybe the trivial file system check should be documented better,
in that it stat's /etc, and mention that anyone who wants to change
which directory is stat'd, just use -e 'stat /dir' instead?

--
  John-Mark Gurney Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-embedded
To unsubscribe, send any mail to "[hidden email]"