hwpmc: NMI cpuxx ... going to debugger

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

hwpmc: NMI cpuxx ... going to debugger

shreyank amartya
Hi,

I'm running a profiler which utilizes hwpmc and am frequently running into
this situation: after a process context switch out I get an NMI. The NMI
handler is unable to service it due to the fact that the pmc hardware is
marked as free at the end of the process_csw_out() (hwpmc_mod.c) routine.

Here are some logs I've collected when this happens:

 1750 231   49360676470980 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
  1749 231   49360676470360 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
  1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
  1747 249   49360676420320 MDP:SWO:1: pc=0xfffff8010a06c400
pp=0xfffff81077179000 enable-msr=0
  1746 249   49360676419000 CSW:SWO:1: cpu=249 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1745 231   49360675789440 MDP:INT:2: retval=0 isnull=16 ovrflw=0
  1744 231   49360675789240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1743 231   49360675788260 CSW:SWO:3: cpu=231 ri=33
  1742 231   49360675788080 MDP:SWO:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1741 231   49360675787880 MDP:CFG:1: cpu=231 ri=0 pm=0x0
  1740 231   49360675787020 CSW:SWO:2: cpu=231 ri=17 val=-164 (samp)
  1739 231   49360675786940 MDP:REA:2: amd-read (post-munge) id=0 -> -164
  1738 231   49360675786880 MDP:REA:2: amd-read (pre-munge) id=0 -> 164
  1737 231   49360675786720 MDP:REA:1: amd-read id=0 class=2
  1736 231   49360675785180 MDP:STO:1: amd-stop ri=0
  1735 231   49360675784160 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1734 231   49360675617080 MDP:INT:2: retval=1 isnull=0 ovrflw=0
  1733 231   49360675615240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1732 231   49360675505160 MDP:SWI:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1731 231   49360675503600 MDP:STA:2: amd-start config=0x530076
  1730 231   49360675503060 MDP:STA:1: amd-start cpu=231 ri=0
  1729 231   49360675500720 MDP:WRI:1: amd-write cpu=231 ri=0
v=fffffffffffd97d4
  1728 231   49360675500240 CSW:SWI:1: cpu=231 ri=17 new=157740
  1727 231   49360675498680 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
  1726 231   49360675498060 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1725 154   49360675363180 CSW:SWO:3: cpu=154 ri=33
  1724 154   49360675362020 MDP:SWO:1: pc=0xfffff80106d59600
pp=0xfffff81077179000 enable-msr=0
  1723 154   49360675360220 CSW:SWO:1: cpu=154 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1722 231   49360675009200 CSW:SWO:3: cpu=231 ri=33
  1721 231   49360675009020 MDP:SWO:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1720 231   49360675008320 MDP:CFG:1: cpu=231 ri=0 pm=0x0
  1719 231   49360675006220 CSW:SWO:2: cpu=231 ri=17 val=157740 (samp)
  1718 231   49360675005700 MDP:REA:2: amd-read (post-munge) id=0 -> 157740
  1717 231   49360675005240 MDP:REA:2: amd-read (pre-munge) id=0 ->
281474976552916
  1716 231   49360675004140 MDP:REA:1: amd-read id=0 class=2
  1715 231   49360675002660 MDP:STO:1: amd-stop ri=0
  1714 231   49360675001460 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1713 231   49360674941580 MDP:INT:2: retval=1 isnull=0 ovrflw=0
  1712 231   49360674939500 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1711 231   49360674793500 MDP:SWI:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1710 231   49360674792680 MDP:STA:2: amd-start config=0x530076
  1709 220   49360674792340 CSW:SWO:3: cpu=220 ri=33
  1708 220   49360674791060 MDP:SWO:1: pc=0xfffff80106f8a600
pp=0xfffff81077179000 enable-msr=0
  1707 231   49360674790840 MDP:STA:1: amd-start cpu=231 ri=0

The stray NMI is entry 1744 & 1745 (isnull=16). In entry 1741 when the
context switch out occurs, the hardware is marked free (pm=0x0). The NMI
handler is unable to find the hardware and ends up returning 0. Entry 1740
reports the val=-164, I'm not sure about the significance but the value
always seems to be negative when this problem occurs.

Once i get a stray NMI, I do not get any further interrupts which is
results in loss of samples in the generated pmclog.

I'm using hwpmc from the most recent stable/12 branch which is same as
current (with a stable/12 kernel). I'm running it on a java application
with a single event in sampling mode.

Any idea on how to resolve/avoid this problem?
Any pointers or suggestion are appreciated.

Thanks
Shreyank
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-drivers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: hwpmc: NMI cpuxx ... going to debugger

shreyank amartya
Hi,

I have narrowed down the problem to this:

- Inside process_csw_out() (during thread context switch out) routine in
hwpmc_mod.c, we stop the pmc and record the current value for reloading it
at a future time. here, I observe that the pmc has already overflowed while
recording the value. The NMI handler is invoked near the end of the routine
and it fails to service the interrupt.

Now, is this expected behavior? Should I modify the code to deal with this
situation? Or should this situation not occur at all, in which case this
could be a hardware problem.
If you could shed some light on this, I will be able to pursue this further.

Thanks for your time
Shreyank

On Wed, Sep 4, 2019 at 12:09 PM Andriy Gapon <[hidden email]> wrote:

> On 03/09/2019 21:03, shreyank amartya wrote:
> > Hi,
> >
> > I'm running a profiler which utilizes hwpmc and am frequently running
> into this
> > situation: after a process context switch out I get an NMI. The NMI
> handler is
> > unable to service it due to the fact that the pmc hardware is marked as
> free at
> > the end of the process_csw_out() (hwpmc_mod.c) routine.
> >
> > Here are some logs I've collected when this happens:
> >
> >  1750 231   49360676470980 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
> >   1749 231   49360676470360 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
> >   1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
> >   1747 249   49360676420320 MDP:SWO:1: pc=0xfffff8010a06c400
> > pp=0xfffff81077179000 enable-msr=0
> >   1746 249   49360676419000 CSW:SWO:1: cpu=249 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1745 231   49360675789440 MDP:INT:2: retval=0 isnull=16 ovrflw=0
> >   1744 231   49360675789240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
> >   1743 231   49360675788260 CSW:SWO:3: cpu=231 ri=33
> >   1742 231   49360675788080 MDP:SWO:1: pc=0xfffff80106fe1400
> > pp=0xfffff81077179000 enable-msr=0
> >   1741 231   49360675787880 MDP:CFG:1: cpu=231 ri=0 pm=0x0
> >   1740 231   49360675787020 CSW:SWO:2: cpu=231 ri=17 val=-164 (samp)
> >   1739 231   49360675786940 MDP:REA:2: amd-read (post-munge) id=0 -> -164
> >   1738 231   49360675786880 MDP:REA:2: amd-read (pre-munge) id=0 -> 164
> >   1737 231   49360675786720 MDP:REA:1: amd-read id=0 class=2
> >   1736 231   49360675785180 MDP:STO:1: amd-stop ri=0
> >   1735 231   49360675784160 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1734 231   49360675617080 MDP:INT:2: retval=1 isnull=0 ovrflw=0
> >   1733 231   49360675615240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
> >   1732 231   49360675505160 MDP:SWI:1: pc=0xfffff80106fe1400
> > pp=0xfffff81077179000 enable-msr=0
> >   1731 231   49360675503600 MDP:STA:2: amd-start config=0x530076
> >   1730 231   49360675503060 MDP:STA:1: amd-start cpu=231 ri=0
> >   1729 231   49360675500720 MDP:WRI:1: amd-write cpu=231 ri=0
> v=fffffffffffd97d4
> >   1728 231   49360675500240 CSW:SWI:1: cpu=231 ri=17 new=157740
> >   1727 231   49360675498680 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
> >   1726 231   49360675498060 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1725 154   49360675363180 CSW:SWO:3: cpu=154 ri=33
> >   1724 154   49360675362020 MDP:SWO:1: pc=0xfffff80106d59600
> > pp=0xfffff81077179000 enable-msr=0
> >   1723 154   49360675360220 CSW:SWO:1: cpu=154 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1722 231   49360675009200 CSW:SWO:3: cpu=231 ri=33
> >   1721 231   49360675009020 MDP:SWO:1: pc=0xfffff80106fe1400
> > pp=0xfffff81077179000 enable-msr=0
> >   1720 231   49360675008320 MDP:CFG:1: cpu=231 ri=0 pm=0x0
> >   1719 231   49360675006220 CSW:SWO:2: cpu=231 ri=17 val=157740 (samp)
> >   1718 231   49360675005700 MDP:REA:2: amd-read (post-munge) id=0 ->
> 157740
> >   1717 231   49360675005240 MDP:REA:2: amd-read (pre-munge) id=0 ->
> 281474976552916
> >   1716 231   49360675004140 MDP:REA:1: amd-read id=0 class=2
> >   1715 231   49360675002660 MDP:STO:1: amd-stop ri=0
> >   1714 231   49360675001460 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
> (5930,
> > java) pp=0xfffff81077179000
> >   1713 231   49360674941580 MDP:INT:2: retval=1 isnull=0 ovrflw=0
> >   1712 231   49360674939500 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
> >   1711 231   49360674793500 MDP:SWI:1: pc=0xfffff80106fe1400
> > pp=0xfffff81077179000 enable-msr=0
> >   1710 231   49360674792680 MDP:STA:2: amd-start config=0x530076
> >   1709 220   49360674792340 CSW:SWO:3: cpu=220 ri=33
> >   1708 220   49360674791060 MDP:SWO:1: pc=0xfffff80106f8a600
> > pp=0xfffff81077179000 enable-msr=0
> >   1707 231   49360674790840 MDP:STA:1: amd-start cpu=231 ri=0
> >
> > The stray NMI is entry 1744 & 1745 (isnull=16). In entry 1741 when the
> context
> > switch out occurs, the hardware is marked free (pm=0x0). The NMI handler
> is
> > unable to find the hardware and ends up returning 0. Entry 1740 reports
> the
> > val=-164, I'm not sure about the significance but the value always seems
> to be
> > negative when this problem occurs.
> >
> > Once i get a stray NMI, I do not get any further interrupts which is
> results in
> > loss of samples in the generated pmclog.
> >
> > I'm using hwpmc from the most recent stable/12 branch which is same as
> current
> > (with a stable/12 kernel). I'm running it on a java application with a
> single
> > event in sampling mode.
> >
> > Any idea on how to resolve/avoid this problem?
> > Any pointers or suggestion are appreciated.
>
> Sorry, I do not have time to debug this.
> It looks like there could be a discrepancy between the hardware state and
> the
> driver state, so the hardware thinks that it should generate the NMI, but
> the
> driver does not expect to get it.  Or there could be a hardware problem,
> errata
> are not unheard of.
> It seems that you already have some data on what's going on, so maybe you
> will
> be able to pursue this further and narrow down the possibilities.
>
> Sorry again for not being able to actually help.
>
>
> --
> Andriy Gapon
>
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-drivers
To unsubscribe, send any mail to "[hidden email]"