Deadlock involving truss -f, pdfork() and wait4()

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Deadlock involving truss -f, pdfork() and wait4()

Ryan Stone-2
I've hit an issue with a simple use of pdfork().  I have a process
that calls pdfork() and the parent immediately does a wait4() on the
child pid.  This works fine under normal conditions, but if the parent
is run under truss -f, the three processes deadlock.  If I switch out
pdfork() for fork(), the deadlock does not occur.

This C file demonstrates the issue:

https://people.freebsd.org/~rstone/pdfork.c

If I run "truss -f ./pdfork", which uses fork(), it completes within a
second.  If I run "truss -f ./pdfork -p", which uses pdfork(), the
processes deadlock.  If I run "./pdfork -p" without truss, it
completes normally.

procstat reports the following kernel stacks:

27572 102043 truss               -                   mi_switch+0xe2
sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
kern_wait6+0x695 sys_wait6+0x9f amd64_syscall+0x36e
fast_syscall_common+0x101
27573 102469 pdfork              -                   mi_switch+0xe2
sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
kern_wait6+0x695 sys_wait4+0x78 amd64_syscall+0x36e
fast_syscall_common+0x101
27574 102053 pdfork              -                   mi_switch+0xe2
thread_suspend_switch+0xd4 ptracestop+0x13b fork_return+0x14e
fork_exit+0x83 fork_trampoline+0xe

As near as I can tell, truss is blocked waiting for ptrace events, the
parent process is blocked in wait4, and the child process is perhaps
waiting for its parent to exit the kernel so it can send the ptrace
event?

I really don't see anything obvious in the pdfork() code path that
would cause this to happen when fork() doesn't have the problem.  It
may be that pdfork() just changes the timing enough to expose a latent
bug.

I'm seeing this on a recentish current (r351363).
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock involving truss -f, pdfork() and wait4()

Mariusz Zaborski
Hello Ryan,

Can you verify is this patch fix your issue:
https://reviews.freebsd.org/D20362

Thanks,
Mariusz

On Thu, 12 Sep 2019 at 21:37, Ryan Stone <[hidden email]> wrote:

>
> I've hit an issue with a simple use of pdfork().  I have a process
> that calls pdfork() and the parent immediately does a wait4() on the
> child pid.  This works fine under normal conditions, but if the parent
> is run under truss -f, the three processes deadlock.  If I switch out
> pdfork() for fork(), the deadlock does not occur.
>
> This C file demonstrates the issue:
>
> https://people.freebsd.org/~rstone/pdfork.c
>
> If I run "truss -f ./pdfork", which uses fork(), it completes within a
> second.  If I run "truss -f ./pdfork -p", which uses pdfork(), the
> processes deadlock.  If I run "./pdfork -p" without truss, it
> completes normally.
>
> procstat reports the following kernel stacks:
>
> 27572 102043 truss               -                   mi_switch+0xe2
> sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
> kern_wait6+0x695 sys_wait6+0x9f amd64_syscall+0x36e
> fast_syscall_common+0x101
> 27573 102469 pdfork              -                   mi_switch+0xe2
> sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
> kern_wait6+0x695 sys_wait4+0x78 amd64_syscall+0x36e
> fast_syscall_common+0x101
> 27574 102053 pdfork              -                   mi_switch+0xe2
> thread_suspend_switch+0xd4 ptracestop+0x13b fork_return+0x14e
> fork_exit+0x83 fork_trampoline+0xe
>
> As near as I can tell, truss is blocked waiting for ptrace events, the
> parent process is blocked in wait4, and the child process is perhaps
> waiting for its parent to exit the kernel so it can send the ptrace
> event?
>
> I really don't see anything obvious in the pdfork() code path that
> would cause this to happen when fork() doesn't have the problem.  It
> may be that pdfork() just changes the timing enough to expose a latent
> bug.
>
> I'm seeing this on a recentish current (r351363).
> _______________________________________________
> [hidden email] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock involving truss -f, pdfork() and wait4()

Ryan Stone-2
This gets me a little further but now the wait4 call by the parent
never reaps the child and instead blocks forever:

# truss -f ./pdfork -p
 706: mmap(0x0,131072,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34361
970688 (0x800221000)
 706: issetugid()                               = 0 (0x0)
 706: open("/etc/libmap.conf",O_RDONLY|O_CLOEXEC,010440020) = 3 (0x3)
 706: fstat(3,{ mode=-rw-r--r-- ,inode=241058,size=47,blksize=32768 }) = 0 (0x0
)
 706: read(3,"# $FreeBSD$\nincludedir /usr/loc"...,47) = 47 (0x2f)
 706: close(3)                                  = 0 (0x0)
 706: open("/usr/local/etc/libmap.d",O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC,
0165) ERR#2 'No such file or directory'
 706: open("/var/run/ld-elf.so.hints",O_RDONLY|O_CLOEXEC,010002355) = 3 (0x3)
 706: read(3,"Ehnt\^A\0\0\0\M^@\0\0\0A\0\0\0\0"...,128) = 128 (0x80)
 706: fstat(3,{ mode=-r--r--r-- ,inode=72,size=193,blksize=4096 }) = 0 (0x0)
 706: pread(3,"/lib:/usr/lib:/usr/lib/compat:/u"...,65,0x80) = 65 (0x41)
 706: close(3)                                  = 0 (0x0)
 706: open("/lib/libc.so.7",O_RDONLY|O_CLOEXEC|O_VERIFY,057400) = 3 (0x3)
 706: fstat(3,{ mode=-r--r--r-- ,inode=402754,size=1915744,blksize=32768 }) = 0
(0x0)
 706: mmap(0x0,4096,PROT_READ,MAP_PRIVATE|MAP_PREFAULT_READ,3,0x0) = 3436210176
0 (0x800241000)
 706: mmap(0x0,4169728,PROT_NONE,MAP_GUARD,-1,0x0) = 34362105856 (0x800242000)
 706: mmap(0x800242000,524288,PROT_READ,MAP_PRIVATE|MAP_FIXED|MAP_NOCORE|MAP_PR
EFAULT_READ,3,0x0) = 34362105856 (0x800242000)
 706: mmap(0x8002c2000,1327104,PROT_READ|PROT_EXEC,MAP_PRIVATE|MAP_FIXED|MAP_NO
CORE|MAP_PREFAULT_READ,3,0x80000) = 34362630144 (0x8002c2000)
 706: mmap(0x800406000,61440,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_FIXED|MAP_PRE
FAULT_READ,3,0x1c4000) = 34363957248 (0x800406000)
 706: mmap(0x800415000,2256896,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_FIXED|MAP_A
NON,-1,0x0) = 34364018688 (0x800415000)
 706: munmap(0x800241000,4096)                  = 0 (0x0)
 706: close(3)                                  = 0 (0x0)
 706: mprotect(0x80040c000,36864,PROT_READ)     = 0 (0x0)
 706: sysarch(AMD64_SET_FSBASE,0x7fffffffe110)  = 0 (0x0)
 706: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 706: mprotect(0x80040c000,36864,PROT_READ|PROT_WRITE) = 0 (0x0)
 706: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 706: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 706: mprotect(0x80040c000,36864,PROT_READ)     = 0 (0x0)
 706: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 706: readlink("/etc/malloc.conf",0x7fffffffd830,1024) ERR#2 'No such file or d
irectory'
 706: issetugid()                               = 0 (0x0)
 706: __sysctl(0x7fffffffd7e0,0x2,0x7fffffffd7dc,0x7fffffffd7d0,0x0,0x0) = 0 (0
x0)
 706: mmap(0x0,2097152,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(21
),-1,0x0) = 34368126976 (0x800800000)
 706: cap_getmode({ 0 })                        = 0 (0x0)
 706: open("/dev/hpet0",O_RDONLY,00)            = 3 (0x3)
 706: mmap(0x0,4096,PROT_READ,MAP_SHARED,3,0x0) = 34362101760 (0x800241000)
 706: close(3)                                  = 0 (0x0)
 706: mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),
-1,0x0) = 34366275584 (0x80063c000)
 706: mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(21
),-1,0x0) = 34370224128 (0x800a00000)
 706: mmap(0x0,4096,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),-
1,0x0) = 34366308352 (0x800644000)
 706: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 706: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 706: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 706: mprotect(0x203000,4096,PROT_READ)         = 0 (0x0)
 706: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 706: pdfork(0x7fffffffeba8,0x0)                = 708 (0x2c4)
 708: <new process>
 708: nanosleep({ 1.000000000 })                = 0 (0x0)
 708: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 708: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 708: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 708: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
 708: sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIG
TERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFS
Z|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
 708: sigprocmask(SIG_SETMASK,{ },0x0)          = 0 (0x0)
load: 0.27  cmd: pdfork 706 [wait] 18.20r 0.00u 0.00s 0% 2072k

# ps
PID TT  STAT    TIME COMMAND
698 u0  Is   0:00.01 login [pam] (login)
700 u0  I    0:00.04 -sh (sh)
705 u0  I+   0:00.10 truss -f ./pdfork -p
706 u0  IX+  0:00.01 ./pdfork -p
708 u0  Z+   0:00.00 <defunct>
714  0  S    0:00.01 su
715  0  S    0:00.01 su (sh)
716  0  R+   0:00.00 ps
# procstat -kk 708
 PID    TID COMM                TDNAME              KSTACK
# procstat -kk 706
 PID    TID COMM                TDNAME              KSTACK
 706 100095 pdfork              -                   mi_switch+0x174
sleepq_switch+0x110 sleepq_catch_signals+0x417 slee
pq_wait_sig+0xf _sleep+0x2d0 kern_wait6+0x48f sys_wait4+0x78
amd64_syscall+0x337 fast_syscall_common+0x101
# procstat -kk 705
 PID    TID COMM                TDNAME              KSTACK
 705 100077 truss               -                   mi_switch+0x174
sleepq_switch+0x110 sleepq_catch_signals+0x417 slee
pq_wait_sig+0xf _sleep+0x2d0 kern_wait6+0x48f sys_wait6+0x9f
amd64_syscall+0x337 fast_syscall_common+0x101

On Fri, Sep 13, 2019 at 10:05 AM Mariusz Zaborski <[hidden email]> wrote:

>
> Hello Ryan,
>
> Can you verify is this patch fix your issue:
> https://reviews.freebsd.org/D20362
>
> Thanks,
> Mariusz
>
> On Thu, 12 Sep 2019 at 21:37, Ryan Stone <[hidden email]> wrote:
> >
> > I've hit an issue with a simple use of pdfork().  I have a process
> > that calls pdfork() and the parent immediately does a wait4() on the
> > child pid.  This works fine under normal conditions, but if the parent
> > is run under truss -f, the three processes deadlock.  If I switch out
> > pdfork() for fork(), the deadlock does not occur.
> >
> > This C file demonstrates the issue:
> >
> > https://people.freebsd.org/~rstone/pdfork.c
> >
> > If I run "truss -f ./pdfork", which uses fork(), it completes within a
> > second.  If I run "truss -f ./pdfork -p", which uses pdfork(), the
> > processes deadlock.  If I run "./pdfork -p" without truss, it
> > completes normally.
> >
> > procstat reports the following kernel stacks:
> >
> > 27572 102043 truss               -                   mi_switch+0xe2
> > sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
> > kern_wait6+0x695 sys_wait6+0x9f amd64_syscall+0x36e
> > fast_syscall_common+0x101
> > 27573 102469 pdfork              -                   mi_switch+0xe2
> > sleepq_catch_signals+0x425 sleepq_wait_sig+0xf _sleep+0x1bf
> > kern_wait6+0x695 sys_wait4+0x78 amd64_syscall+0x36e
> > fast_syscall_common+0x101
> > 27574 102053 pdfork              -                   mi_switch+0xe2
> > thread_suspend_switch+0xd4 ptracestop+0x13b fork_return+0x14e
> > fork_exit+0x83 fork_trampoline+0xe
> >
> > As near as I can tell, truss is blocked waiting for ptrace events, the
> > parent process is blocked in wait4, and the child process is perhaps
> > waiting for its parent to exit the kernel so it can send the ptrace
> > event?
> >
> > I really don't see anything obvious in the pdfork() code path that
> > would cause this to happen when fork() doesn't have the problem.  It
> > may be that pdfork() just changes the timing enough to expose a latent
> > bug.
> >
> > I'm seeing this on a recentish current (r351363).
> > _______________________________________________
> > [hidden email] mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock involving truss -f, pdfork() and wait4()

Mark Johnston-2
On Fri, Sep 13, 2019 at 02:12:56PM -0400, Ryan Stone wrote:
> This gets me a little further but now the wait4 call by the parent
> never reaps the child and instead blocks forever:

Does it perform a wildcarded wait(), or does it explicitly specify the
PID of the child?  By design, the former will not return children
created by pdfork().
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock involving truss -f, pdfork() and wait4()

Conrad Meyer-2
On Fri, Sep 13, 2019 at 11:37 AM Mark Johnston <[hidden email]> wrote:
>
> On Fri, Sep 13, 2019 at 02:12:56PM -0400, Ryan Stone wrote:
> > This gets me a little further but now the wait4 call by the parent
> > never reaps the child and instead blocks forever:
>
> Does it perform a wildcarded wait(), or does it explicitly specify the
> PID of the child?  By design, the former will not return children
> created by pdfork().

Explicit PID:

https://people.freebsd.org/~rstone/pdfork.c

Best,
Conrad
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock involving truss -f, pdfork() and wait4()

Ryan Stone-2
In reply to this post by Mark Johnston-2
As Conrad has pointed out, it's an explicit PID.  The test completes
successfully when not run under truss -f.

On Fri, Sep 13, 2019 at 2:37 PM Mark Johnston <[hidden email]> wrote:
>
> On Fri, Sep 13, 2019 at 02:12:56PM -0400, Ryan Stone wrote:
> > This gets me a little further but now the wait4 call by the parent
> > never reaps the child and instead blocks forever:
>
> Does it perform a wildcarded wait(), or does it explicitly specify the
> PID of the child?  By design, the former will not return children
> created by pdfork().
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[hidden email]"