ufs snapshot is sometimes corrupt on gjourneled partition

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ufs snapshot is sometimes corrupt on gjourneled partition

Andreas Longwitz
I try to understand the cause for the "free inode" problem described in
  https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html.

I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
gjournaled partition for /home:
mount -t ufs | grep /home -->
/dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
noatime, gjournal)

My test creates one snapshot of /home (gets alway inode 4) and removes
this snapshot:

  for i in 1 2 3 4 5 6 7 8; do
     echo "starting snaptest $i" >/dev/console
     mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
     echo $(ls -ils /home/.snap/fscktest) >/dev/console
     rm -f /home/.snap/fscktest /home
  done

I never have more than this one snapshot at work and during the test I
never have any other
user processes working on /home. A typical output looks like this:

Jun 21 15:59:52 root: starting snaptest 1
Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:53 root: starting snaptest 2
Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
Jun 21 15:59:54 root: starting snaptest 3
Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
Jun 21 15:59:56 root: starting snaptest 4
Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:57 root: starting snaptest 5
Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:58 root: starting snaptest 6
Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
21 Jun 15:59 /home/.snap/fscktest
Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
Jun 21 15:59:59 root: starting snaptest 7
Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 16:00 /home/.snap/fscktest
Jun 21 16:00:00 root: starting snaptest 8
Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
21 Jun 16:00 /home/.snap/fscktest

The "free inode /home/4 had NNN blocks" message during run of the mount
command is output of ffs_valloc(), because ffs_load_inode() has load the
disk inode 4 with a non zero i_blocks field. The corresponding "freeing
inode /home/4 with NNN blocks" message during the previous rm command
is output of my following diagnostic patch in function ffs_truncate():

--- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
+++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
@@ -551,6 +551,9 @@
                DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
        else    /* sanity */
                DIP_SET(ip, i_blocks, 0);
+       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
+               printf("freeing inode %s/%lu with %ld blocks\n",
+                       fs->fs_fsmnt, (u_long)ip->i_number,
(long)DIP(ip, i_blocks));
        ip->i_flag |= IN_CHANGE;
 #ifdef QUOTA
        (void) chkdq(ip, -blocksreleased, NOCRED, 0);

The rm command can only free all the blocks of the snapshotfile (means
i_blocks for inode 4 ends with zero) , if this file has the "correct" size:

ls -ils /home/.snap/fscktest -->
4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
/home/.snap/fscktest

The size of the /home partition is given by
diskinfo /dev/mirror/gmsvt7p10.journal -->
/dev/mirror/gmsvt7p10.journal   512     90762954240     177271395

So we have 2769865 full 32kB blocks with size 90631864320. During
creating a snapshot a "last block" (32kB) is written at this offset
ending at 90762969088. Finally the snapblklist is written with
VOP_WRITE: "Write out the list of allocated blocks to the end of the
snapshot". In all my correct tests the table snapblklist is 1152 bytes
in size giving the correct size of the snapshot file : 90762970240. In
this case the table snapblklist has 144 entries of length 8: one lenght
entry and 143 logical block numbers recorded in mapacct_ufs2():

          if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
                   *ip->i_snapblklist++ = lblkno;

The console output above shows three error situations with block
counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
reduced size of the snapblocklist at the end of the snapshotfile, so in
these cases the snapshotfile is corrupt.

I use a test kernel with some extra counters ct_* in mapacct_ufs2():

           ++ct_blkno_all;
           if (blkno == 0)
                   ++ct_blkno_0;
           if (blkno == BLK_NOCOPY)
                   ++ct_blkno_nocopy;
           if (blkno == BLK_SNAP)
                   ++ct_blkno_snap;
           if (blkno == 0 || blkno == BLK_NOCOPY)
                   continue;
           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
                   *ip->i_snapblklist++ = lblkno;
                   ++ct_snapblklist;
           }
           if (blkno == BLK_SNAP)
                   blkno = blkstofrags(fs, lblkno);
           ++ct_blkfree;
           ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
               vp->v_type, NULL);

and for the 8 test runs shown above I can see these results using DTrace
at probe expunge_ufs2:return (blkno_snap is always 0):

 test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
 -------------------------------------------------------------------
 1 ok   2770545   353320  2416404      143         821     2416404
 2 bad  2770545   587860  2181875      132         810     2416404
 3 bad  2770545   956582  1813175      110         788     2416393
 4 ok   2770545   353364  2416360      143         821     2416360
 5 ok   2770545   353364  2416360      143         821     2416360
 6 bad  2770545   418376  2351351      140         818     2416360
 7 ok   2770545   353367  2416357      143         821     2416357
 8 ok   2770545   353367  2416357      143         821     2416357

For correct tests the sum of blkno_0 and blkno_nocopy is always the same
(2769724), for bad tests especially the counter for blkno_nocopy is
significant lower. In the test table I give one more column cg_nocopy
for a counter I have added in cgaccount() to see how many entries are
set to BLK_NOCOPY during copy of cylinder group maps:

                if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
                        ++ct_cg_nocopy;
                        DIP_SET(ip, i_db[loc], BLK_NOCOPY);
                }
                ...
                if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
                        ++ct_cg_nocopy;
                        ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
BLK_NOCOPY;
                }

For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
have changed to 0.

I looks like the rm command removing the previous snapshot in some way
runs "in the background" simultan to expunge_ufs2() and changes some of
the BLK_NOCOPY's to zero. So this may be a buffer management problem
which only exists on gjourneled partitions, maybe getblk/readblock used
in indiracct_ufs2() is not compatibel with gjournel in the special case
of creating or removing a spapshot. A hint in this direction is the
fact, that the first test after cleaning the partition with
        umount /home; fsck -y /home; mount /home
always succeeds. The following modified test procedure never fails:

  for i in 1 2 3 4 5 6 7 8; do
     echo "starting snaptest $i" >/dev/console
     mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
     echo $(ls -ils /home/.snap/fscktest) >/dev/console
     rm -f /home/.snap/fscktest /home
     umount /home
     mount /home
  done

Another proof that the snapshot file is corrupt when the snapblklist is
shortend is the fact that the rm command sporadically panics in a kernel
routine that is known to be correct:

nread portion of the kernel message buffer:
dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
panic: ffs_blkfree_cg: freeing free block
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe0857e3b1c0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
panic() at panic+0x43/frame 0xfffffe0857e3b310
ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
--- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
0x7fffffffe988, rbp = 0x7fffffffea20 ---

Any hints solving this problem are welcome.
--

Andreas Longwitz

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Andreas Longwitz
Oh excuse me, but there was a superfluous point at the end of the link,
it should be:

  https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html

And of course I did not want to remove /home, the correct rm command is

  rm -f /home/.snap/fscktest

Andreas Longwitz
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Konstantin Belousov
In reply to this post by Andreas Longwitz
On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
> I try to understand the cause for the "free inode" problem described in
>   https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html.
>
> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
> gjournaled partition for /home:
> mount -t ufs | grep /home -->
> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
> noatime, gjournal)
As the first thing to try, if you perform your tests on the raw
partition without gjournal, does the problem stay around ?

>
> My test creates one snapshot of /home (gets alway inode 4) and removes
> this snapshot:
>
>   for i in 1 2 3 4 5 6 7 8; do
>      echo "starting snaptest $i" >/dev/console
>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>      rm -f /home/.snap/fscktest /home
>   done
>
> I never have more than this one snapshot at work and during the test I
> never have any other
> user processes working on /home. A typical output looks like this:
>
> Jun 21 15:59:52 root: starting snaptest 1
> Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:53 root: starting snaptest 2
> Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
> Jun 21 15:59:54 root: starting snaptest 3
> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
> Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
> Jun 21 15:59:56 root: starting snaptest 4
> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
> Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:57 root: starting snaptest 5
> Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:58 root: starting snaptest 6
> Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
> 21 Jun 15:59 /home/.snap/fscktest
> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
> Jun 21 15:59:59 root: starting snaptest 7
> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
> Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
> 21 Jun 16:00 /home/.snap/fscktest
> Jun 21 16:00:00 root: starting snaptest 8
> Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
> 21 Jun 16:00 /home/.snap/fscktest
>
> The "free inode /home/4 had NNN blocks" message during run of the mount
> command is output of ffs_valloc(), because ffs_load_inode() has load the
> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
> inode /home/4 with NNN blocks" message during the previous rm command
> is output of my following diagnostic patch in function ffs_truncate():
>
> --- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
> @@ -551,6 +551,9 @@
>                 DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>         else    /* sanity */
>                 DIP_SET(ip, i_blocks, 0);
> +       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
> +               printf("freeing inode %s/%lu with %ld blocks\n",
> +                       fs->fs_fsmnt, (u_long)ip->i_number,
> (long)DIP(ip, i_blocks));
>         ip->i_flag |= IN_CHANGE;
>  #ifdef QUOTA
>         (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>
> The rm command can only free all the blocks of the snapshotfile (means
> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>
> ls -ils /home/.snap/fscktest -->
> 4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
> /home/.snap/fscktest
>
> The size of the /home partition is given by
> diskinfo /dev/mirror/gmsvt7p10.journal -->
> /dev/mirror/gmsvt7p10.journal   512     90762954240     177271395
>
> So we have 2769865 full 32kB blocks with size 90631864320. During
> creating a snapshot a "last block" (32kB) is written at this offset
> ending at 90762969088. Finally the snapblklist is written with
> VOP_WRITE: "Write out the list of allocated blocks to the end of the
> snapshot". In all my correct tests the table snapblklist is 1152 bytes
> in size giving the correct size of the snapshot file : 90762970240. In
> this case the table snapblklist has 144 entries of length 8: one lenght
> entry and 143 logical block numbers recorded in mapacct_ufs2():
>
>           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>                    *ip->i_snapblklist++ = lblkno;
>
> The console output above shows three error situations with block
> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
> reduced size of the snapblocklist at the end of the snapshotfile, so in
> these cases the snapshotfile is corrupt.
I am not sure what do you mean by 'match' there.  Could you explicitely
mention what relations between snapshot size and leaked blocks of the
free snapshot inode did you noted ?

>
> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>
>            ++ct_blkno_all;
>            if (blkno == 0)
>                    ++ct_blkno_0;
>            if (blkno == BLK_NOCOPY)
>                    ++ct_blkno_nocopy;
>            if (blkno == BLK_SNAP)
>                    ++ct_blkno_snap;
>            if (blkno == 0 || blkno == BLK_NOCOPY)
>                    continue;
>            if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>                    *ip->i_snapblklist++ = lblkno;
>                    ++ct_snapblklist;
>            }
>            if (blkno == BLK_SNAP)
>                    blkno = blkstofrags(fs, lblkno);
>            ++ct_blkfree;
>            ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>                vp->v_type, NULL);
>
> and for the 8 test runs shown above I can see these results using DTrace
> at probe expunge_ufs2:return (blkno_snap is always 0):
>
>  test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>  -------------------------------------------------------------------
>  1 ok   2770545   353320  2416404      143         821     2416404
>  2 bad  2770545   587860  2181875      132         810     2416404
>  3 bad  2770545   956582  1813175      110         788     2416393
>  4 ok   2770545   353364  2416360      143         821     2416360
>  5 ok   2770545   353364  2416360      143         821     2416360
>  6 bad  2770545   418376  2351351      140         818     2416360
>  7 ok   2770545   353367  2416357      143         821     2416357
>  8 ok   2770545   353367  2416357      143         821     2416357
>
> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
> (2769724), for bad tests especially the counter for blkno_nocopy is
> significant lower. In the test table I give one more column cg_nocopy
> for a counter I have added in cgaccount() to see how many entries are
> set to BLK_NOCOPY during copy of cylinder group maps:
>
>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>                         ++ct_cg_nocopy;
>                         DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>                 }
>                 ...
>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>                         ++ct_cg_nocopy;
>                         ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
> BLK_NOCOPY;
>                 }
>
> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
> have changed to 0.
>
> I looks like the rm command removing the previous snapshot in some way
> runs "in the background" simultan to expunge_ufs2() and changes some of
> the BLK_NOCOPY's to zero. So this may be a buffer management problem
> which only exists on gjourneled partitions, maybe getblk/readblock used
> in indiracct_ufs2() is not compatibel with gjournel in the special case
> of creating or removing a spapshot. A hint in this direction is the
> fact, that the first test after cleaning the partition with
>         umount /home; fsck -y /home; mount /home
> always succeeds. The following modified test procedure never fails:
>
>   for i in 1 2 3 4 5 6 7 8; do
>      echo "starting snaptest $i" >/dev/console
>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>      rm -f /home/.snap/fscktest /home
>      umount /home
>      mount /home
>   done
After the allocations of required blocks for the snapshot inode
are finished, the filesystem is suspended. You can see the
call to vfs_write_suspend() in the ffs_snapshot() where the
suspension is enforced. As part of the suspension, all soft-update
workitems are flushed, this is done by vfs_write_suspend() calling
VFS_SYNC(MNT_SUSPEND).

UFS classifies writers into primary and secondary. Primary are mostly
the writes initiated by the top-level syscall entries, like direct
calls to write(2) or metadata-changing ops mkdir(), create() and so on.
Secondary are writes performed when system initiates metadata updates
during inactivation, quota updates, softdep background processing and
similar. Primary modifications are blocked outright on suspension, while
secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
call.

If you can provide a proof that some SU-related activity happens after the
suspension is established, this would be interesting to see.
Might be it is something different and much simpler, but I do not see
an obvious mistake in the code, after reading your observations.

>
> Another proof that the snapshot file is corrupt when the snapblklist is
> shortend is the fact that the rm command sporadically panics in a kernel
> routine that is known to be correct:
>
> nread portion of the kernel message buffer:
> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
> panic: ffs_blkfree_cg: freeing free block
> cpuid = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xfffffe0857e3b1c0
> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
> panic() at panic+0x43/frame 0xfffffe0857e3b310
> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>
> Any hints solving this problem are welcome.
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Kirk McKusick-2
> Date: Sat, 24 Jun 2017 18:30:17 +0300
> From: Konstantin Belousov <[hidden email]>
> To: Andreas Longwitz <[hidden email]>
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>
> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
>> I try to understand the cause for the "free inode" problem described in
>>   https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html
>>
>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
>> gjournaled partition for /home:
>> mount -t ufs | grep /home -->
>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
>> noatime, gjournal)
>
> As the first thing to try, if you perform your tests on the raw
> partition without gjournal, does the problem stay around ?

I concur that running your test without gjournal is the next test to try.
I think that your suspicion that there is a race condition with gjournal
is likely correct. And if that is true, then the problem will go away
when gjournal is taken out of the stack.

>> My test creates one snapshot of /home (gets alway inode 4) and removes
>> this snapshot:
>>
>>   for i in 1 2 3 4 5 6 7 8; do
>>      echo "starting snaptest $i" >/dev/console
>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>      rm -f /home/.snap/fscktest
>>   done
>>
>> I never have more than this one snapshot at work and during the test I
>> never have any other
>> user processes working on /home. A typical output looks like this:
>>
>> Jun 21 15:59:52 root: starting snaptest 1
>> Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:53 root: starting snaptest 2
>> Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
>> Jun 21 15:59:54 root: starting snaptest 3
>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
>> Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
>> Jun 21 15:59:56 root: starting snaptest 4
>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
>> Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:57 root: starting snaptest 5
>> Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:58 root: starting snaptest 6
>> Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
>> 21 Jun 15:59 /home/.snap/fscktest
>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
>> Jun 21 15:59:59 root: starting snaptest 7
>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
>> Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>> Jun 21 16:00:00 root: starting snaptest 8
>> Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
>> 21 Jun 16:00 /home/.snap/fscktest
>>
>> The "free inode /home/4 had NNN blocks" message during run of the mount
>> command is output of ffs_valloc(), because ffs_load_inode() has load the
>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
>> inode /home/4 with NNN blocks" message during the previous rm command
>> is output of my following diagnostic patch in function ffs_truncate():
>>
>> --- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
>> @@ -551,6 +551,9 @@
>>                 DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>>         else    /* sanity */
>>                 DIP_SET(ip, i_blocks, 0);
>> +       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
>> +               printf("freeing inode %s/%lu with %ld blocks\n",
>> +                       fs->fs_fsmnt, (u_long)ip->i_number,
>> (long)DIP(ip, i_blocks));
>>         ip->i_flag |= IN_CHANGE;
>>  #ifdef QUOTA
>>         (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>>
>> The rm command can only free all the blocks of the snapshotfile (means
>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>>
>> ls -ils /home/.snap/fscktest -->
>> 4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
>> /home/.snap/fscktest
>>
>> The size of the /home partition is given by
>> diskinfo /dev/mirror/gmsvt7p10.journal -->
>> /dev/mirror/gmsvt7p10.journal   512     90762954240     177271395
>>
>> So we have 2769865 full 32kB blocks with size 90631864320. During
>> creating a snapshot a "last block" (32kB) is written at this offset
>> ending at 90762969088. Finally the snapblklist is written with
>> VOP_WRITE: "Write out the list of allocated blocks to the end of the
>> snapshot". In all my correct tests the table snapblklist is 1152 bytes
>> in size giving the correct size of the snapshot file : 90762970240. In
>> this case the table snapblklist has 144 entries of length 8: one lenght
>> entry and 143 logical block numbers recorded in mapacct_ufs2():
>>
>>           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>>                    *ip->i_snapblklist++ = lblkno;
>>
>> The console output above shows three error situations with block
>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
>> reduced size of the snapblocklist at the end of the snapshotfile, so in
>> these cases the snapshotfile is corrupt.
>
> I am not sure what do you mean by 'match' there.  Could you explicitely
> mention what relations between snapshot size and leaked blocks of the
> free snapshot inode did you noted ?

I too am confused here. Are you saying for example that 192 / 8 == 24
and that the snapblocklist is short by 24 blocks? Because from the table
below, it appears to be short by only 3 blocks.

>> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>>
>>            ++ct_blkno_all;
>>            if (blkno == 0)
>>                    ++ct_blkno_0;
>>            if (blkno == BLK_NOCOPY)
>>                    ++ct_blkno_nocopy;
>>            if (blkno == BLK_SNAP)
>>                    ++ct_blkno_snap;
>>            if (blkno == 0 || blkno == BLK_NOCOPY)
>>                    continue;
>>            if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>>                    *ip->i_snapblklist++ = lblkno;
>>                    ++ct_snapblklist;
>>            }
>>            if (blkno == BLK_SNAP)
>>                    blkno = blkstofrags(fs, lblkno);
>>            ++ct_blkfree;
>>            ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>>                vp->v_type, NULL);
>>
>> and for the 8 test runs shown above I can see these results using DTrace
>> at probe expunge_ufs2:return (blkno_snap is always 0):
>>
>>  test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>>  -------------------------------------------------------------------
>>  1 ok   2770545   353320  2416404      143         821     2416404
>>  2 bad  2770545   587860  2181875      132         810     2416404
>>  3 bad  2770545   956582  1813175      110         788     2416393
>>  4 ok   2770545   353364  2416360      143         821     2416360
>>  5 ok   2770545   353364  2416360      143         821     2416360
>>  6 bad  2770545   418376  2351351      140         818     2416360
>>  7 ok   2770545   353367  2416357      143         821     2416357
>>  8 ok   2770545   353367  2416357      143         821     2416357
>>
>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
>> (2769724), for bad tests especially the counter for blkno_nocopy is
>> significant lower. In the test table I give one more column cg_nocopy
>> for a counter I have added in cgaccount() to see how many entries are
>> set to BLK_NOCOPY during copy of cylinder group maps:
>>
>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>                         ++ct_cg_nocopy;
>>                         DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>>                 }
>>                 ...
>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>                         ++ct_cg_nocopy;
>>                         ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
>> BLK_NOCOPY;
>>                 }
>>
>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
>> have changed to 0.
>>
>> I looks like the rm command removing the previous snapshot in some way
>> runs "in the background" simultan to expunge_ufs2() and changes some of
>> the BLK_NOCOPY's to zero. So this may be a buffer management problem
>> which only exists on gjourneled partitions, maybe getblk/readblock used
>> in indiracct_ufs2() is not compatibel with gjournel in the special case
>> of creating or removing a spapshot. A hint in this direction is the
>> fact, that the first test after cleaning the partition with
>>         umount /home; fsck -y /home; mount /home
>> always succeeds. The following modified test procedure never fails:
>>
>>   for i in 1 2 3 4 5 6 7 8; do
>>      echo "starting snaptest $i" >/dev/console
>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>      rm -f /home/.snap/fscktest /home
>>      umount /home
>>      mount /home
>>   done
>
> After the allocations of required blocks for the snapshot inode
> are finished, the filesystem is suspended. You can see the
> call to vfs_write_suspend() in the ffs_snapshot() where the
> suspension is enforced. As part of the suspension, all soft-update
> workitems are flushed, this is done by vfs_write_suspend() calling
> VFS_SYNC(MNT_SUSPEND).
>
> UFS classifies writers into primary and secondary. Primary are mostly
> the writes initiated by the top-level syscall entries, like direct
> calls to write(2) or metadata-changing ops mkdir(), create() and so on.
> Secondary are writes performed when system initiates metadata updates
> during inactivation, quota updates, softdep background processing and
> similar. Primary modifications are blocked outright on suspension, while
> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
> call.
>
> If you can provide a proof that some SU-related activity happens after the
> suspension is established, this would be interesting to see.
> Might be it is something different and much simpler, but I do not see
> an obvious mistake in the code, after reading your observations.

The mount information at the top shows:

/dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal)

Thus no soft updates are being used. We are running on just a basic UFS
filesystem. So, soft update processing has no relevance here.

>> Another proof that the snapshot file is corrupt when the snapblklist is
>> shortend is the fact that the rm command sporadically panics in a kernel
>> routine that is known to be correct:
>>
>> nread portion of the kernel message buffer:
>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
>> panic: ffs_blkfree_cg: freeing free block
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xfffffe0857e3b1c0
>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
>> panic() at panic+0x43/frame 0xfffffe0857e3b310
>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
>> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>>
>> Any hints solving this problem are welcome.

Per the suggestion at the top, I recommend trying your test without
gjournal present to see if the bug goes away. If that is true, then
we can focus our attention on a possible race during rm in the gjournal
code.

        Kirk McKusick
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Andreas Longwitz
Kirk McKusick wrote:

>> Date: Sat, 24 Jun 2017 18:30:17 +0300
>> From: Konstantin Belousov <[hidden email]>
>> To: Andreas Longwitz <[hidden email]>
>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>
>> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
>>> I try to understand the cause for the "free inode" problem described in
>>>   https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html
>>>
>>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
>>> gjournaled partition for /home:
>>> mount -t ufs | grep /home -->
>>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
>>> noatime, gjournal)
>> As the first thing to try, if you perform your tests on the raw
>> partition without gjournal, does the problem stay around ?
>
> I concur that running your test without gjournal is the next test to try.
> I think that your suspicion that there is a race condition with gjournal
> is likely correct. And if that is true, then the problem will go away
> when gjournal is taken out of the stack.

The problem only accurs when gjournal is active (tunefs -J) independent
of the three possible mount options I have tested: async, noasync, sync.
For my test I prefer async mount option as stated in gjournal(8). Using
soft updates instead of gjournal is ok in all cases.

>>> My test creates one snapshot of /home (gets alway inode 4) and removes
>>> this snapshot:
>>>
>>>   for i in 1 2 3 4 5 6 7 8; do
>>>      echo "starting snaptest $i" >/dev/console
>>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>>      rm -f /home/.snap/fscktest
>>>   done
>>>
>>> I never have more than this one snapshot at work and during the test I
>>> never have any other
>>> user processes working on /home. A typical output looks like this:
>>>
>>> Jun 21 15:59:52 root: starting snaptest 1
>>> Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:53 root: starting snaptest 2
>>> Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
>>> Jun 21 15:59:54 root: starting snaptest 3
>>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
>>> Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
>>> Jun 21 15:59:56 root: starting snaptest 4
>>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
>>> Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:57 root: starting snaptest 5
>>> Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:58 root: starting snaptest 6
>>> Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
>>> 21 Jun 15:59 /home/.snap/fscktest
>>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
>>> Jun 21 15:59:59 root: starting snaptest 7
>>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
>>> Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
>>> 21 Jun 16:00 /home/.snap/fscktest
>>> Jun 21 16:00:00 root: starting snaptest 8
>>> Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
>>> 21 Jun 16:00 /home/.snap/fscktest
>>>
>>> The "free inode /home/4 had NNN blocks" message during run of the mount
>>> command is output of ffs_valloc(), because ffs_load_inode() has load the
>>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
>>> inode /home/4 with NNN blocks" message during the previous rm command
>>> is output of my following diagnostic patch in function ffs_truncate():
>>>
>>> --- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
>>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
>>> @@ -551,6 +551,9 @@
>>>                 DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>>>         else    /* sanity */
>>>                 DIP_SET(ip, i_blocks, 0);
>>> +       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
>>> +               printf("freeing inode %s/%lu with %ld blocks\n",
>>> +                       fs->fs_fsmnt, (u_long)ip->i_number,
>>> (long)DIP(ip, i_blocks));
>>>         ip->i_flag |= IN_CHANGE;
>>>  #ifdef QUOTA
>>>         (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>>>
>>> The rm command can only free all the blocks of the snapshotfile (means
>>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>>>
>>> ls -ils /home/.snap/fscktest -->
>>> 4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
>>> /home/.snap/fscktest
>>>
>>> The size of the /home partition is given by
>>> diskinfo /dev/mirror/gmsvt7p10.journal -->
>>> /dev/mirror/gmsvt7p10.journal   512     90762954240     177271395
>>>
>>> So we have 2769865 full 32kB blocks with size 90631864320. During
>>> creating a snapshot a "last block" (32kB) is written at this offset
>>> ending at 90762969088. Finally the snapblklist is written with
>>> VOP_WRITE: "Write out the list of allocated blocks to the end of the
>>> snapshot". In all my correct tests the table snapblklist is 1152 bytes
>>> in size giving the correct size of the snapshot file : 90762970240. In
>>> this case the table snapblklist has 144 entries of length 8: one lenght
>>> entry and 143 logical block numbers recorded in mapacct_ufs2():
>>>
>>>           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>>>                    *ip->i_snapblklist++ = lblkno;
>>>
>>> The console output above shows three error situations with block
>>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
>>> reduced size of the snapblocklist at the end of the snapshotfile, so in
>>> these cases the snapshotfile is corrupt.
>> I am not sure what do you mean by 'match' there.  Could you explicitely
>> mention what relations between snapshot size and leaked blocks of the
>> free snapshot inode did you noted ?
>
> I too am confused here. Are you saying for example that 192 / 8 == 24
> and that the snapblocklist is short by 24 blocks? Because from the table
> below, it appears to be short by only 3 blocks.

In this example the snapblocklist is short by 24 bytes (= 3 table
entries) and 24/8 = 3 calls to ffs_blkfree are missing.

Ok, I give one more table to explain a little better:

The meaning of the columns:
  #1 number of "free inode blocks" in kernel message
  #2 blocks from #1 divides by 8 (= fs_frag)
  #3 size of snapblklist write at the end of snapshotfile
  #4 ls -s of snapshotfile: 907629...
  #5 enties in table snapblklist, we have #5 = #3 / 8, because each
     entry in the table is of type ufs2_daddr_t
  #6 counter ct_blkfree, thats the number of calls to ffs_blkfree().
  #7 missing table entries in snapblklist and calls of ffs_blkfree().

test     #1   #2    #3        #4    #5   #6  #7
-----------------------------------------------
1 ok      0    0  1152   ..70240   144  821   0
2 bad   704   88  1064   ..70152   133  810  11
3 bad  2112  264   888   ..69976   111  788  33
6 bad   192   24  1128   ..70216   141  818   3

The difference of #5 and #6 is the variable acctit which is set only for
blkno != -1. In all my test (ok and bad) the function mapacct_ufs2() is
called 680 times, the first calls look like this (diffblk = lastblk -
oldblk):

 mapacct_ufs2:entry   /home, inum=4, diffblkp=0xc, lblkno=0
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x3, lblkno=-1
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=12
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x2a4, lblkno=-1
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=4108
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=8204
 mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=12300
 ....

There are exact two calls with lblkno=-1 with together 0x2a7 -2 = 677
runs of the for loop where #5 is skipped.

>>> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>>>
>>>            ++ct_blkno_all;
>>>            if (blkno == 0)
>>>                    ++ct_blkno_0;
>>>            if (blkno == BLK_NOCOPY)
>>>                    ++ct_blkno_nocopy;
>>>            if (blkno == BLK_SNAP)
>>>                    ++ct_blkno_snap;
>>>            if (blkno == 0 || blkno == BLK_NOCOPY)
>>>                    continue;
>>>            if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>>>                    *ip->i_snapblklist++ = lblkno;
>>>                    ++ct_snapblklist;
>>>            }
>>>            if (blkno == BLK_SNAP)
>>>                    blkno = blkstofrags(fs, lblkno);
>>>            ++ct_blkfree;
>>>            ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>>>                vp->v_type, NULL);
>>>
>>> and for the 8 test runs shown above I can see these results using DTrace
>>> at probe expunge_ufs2:return (blkno_snap is always 0):
>>>
>>>  test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>>>  -------------------------------------------------------------------
>>>  1 ok   2770545   353320  2416404      143         821     2416404
>>>  2 bad  2770545   587860  2181875      132         810     2416404
>>>  3 bad  2770545   956582  1813175      110         788     2416393
>>>  4 ok   2770545   353364  2416360      143         821     2416360
>>>  5 ok   2770545   353364  2416360      143         821     2416360
>>>  6 bad  2770545   418376  2351351      140         818     2416360
>>>  7 ok   2770545   353367  2416357      143         821     2416357
>>>  8 ok   2770545   353367  2416357      143         821     2416357
>>>
>>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
>>> (2769724), for bad tests especially the counter for blkno_nocopy is
>>> significant lower. In the test table I give one more column cg_nocopy
>>> for a counter I have added in cgaccount() to see how many entries are
>>> set to BLK_NOCOPY during copy of cylinder group maps:
>>>
>>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>>                         ++ct_cg_nocopy;
>>>                         DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>>>                 }
>>>                 ...
>>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>>                         ++ct_cg_nocopy;
>>>                         ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
>>> BLK_NOCOPY;
>>>                 }
>>>
>>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
>>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
>>> have changed to 0.
>>>
>>> I looks like the rm command removing the previous snapshot in some way
>>> runs "in the background" simultan to expunge_ufs2() and changes some of
>>> the BLK_NOCOPY's to zero. So this may be a buffer management problem
>>> which only exists on gjourneled partitions, maybe getblk/readblock used
>>> in indiracct_ufs2() is not compatibel with gjournel in the special case
>>> of creating or removing a spapshot. A hint in this direction is the
>>> fact, that the first test after cleaning the partition with
>>>         umount /home; fsck -y /home; mount /home
>>> always succeeds. The following modified test procedure never fails:
>>>
>>>   for i in 1 2 3 4 5 6 7 8; do
>>>      echo "starting snaptest $i" >/dev/console
>>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>>      rm -f /home/.snap/fscktest /home
>>>      umount /home
>>>      mount /home
>>>   done
>> After the allocations of required blocks for the snapshot inode
>> are finished, the filesystem is suspended. You can see the
>> call to vfs_write_suspend() in the ffs_snapshot() where the
>> suspension is enforced. As part of the suspension, all soft-update
>> workitems are flushed, this is done by vfs_write_suspend() calling
>> VFS_SYNC(MNT_SUSPEND).
>>
>> UFS classifies writers into primary and secondary. Primary are mostly
>> the writes initiated by the top-level syscall entries, like direct
>> calls to write(2) or metadata-changing ops mkdir(), create() and so on.
>> Secondary are writes performed when system initiates metadata updates
>> during inactivation, quota updates, softdep background processing and
>> similar. Primary modifications are blocked outright on suspension, while
>> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
>> call.
>>
>> If you can provide a proof that some SU-related activity happens after the
>> suspension is established, this would be interesting to see.
>> Might be it is something different and much simpler, but I do not see
>> an obvious mistake in the code, after reading your observations.
>
> The mount information at the top shows:
>
> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal)
>
> Thus no soft updates are being used. We are running on just a basic UFS
> filesystem. So, soft update processing has no relevance here.
>
>>> Another proof that the snapshot file is corrupt when the snapblklist is
>>> shortend is the fact that the rm command sporadically panics in a kernel
>>> routine that is known to be correct:
>>>
>>> nread portion of the kernel message buffer:
>>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
>>> panic: ffs_blkfree_cg: freeing free block
>>> cpuid = 1
>>> KDB: stack backtrace:
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>>> 0xfffffe0857e3b1c0
>>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
>>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
>>> panic() at panic+0x43/frame 0xfffffe0857e3b310
>>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
>>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
>>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
>>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
>>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
>>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
>>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
>>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
>>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
>>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
>>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
>>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
>>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
>>> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>>>
>>> Any hints solving this problem are welcome.
>
> Per the suggestion at the top, I recommend trying your test without
> gjournal present to see if the bug goes away. If that is true, then
> we can focus our attention on a possible race during rm in the gjournal
> code.
>
> Kirk McKusick
I have done as explained above. Some more hints may help for debugging
this.

1. Output of dumpfs of the partition is

magic   19540119 (UFS2) time    Sun Jun 25 03:46:09 2017
superblock location     65536   id      [ 561bcaa3 624b156b ]
ncg     139     size    22158924        blocks  21459451
bsize   32768   shift   15      mask    0xffff8000
fsize   4096    shift   12      mask    0xfffff000
frag    8       shift   3       fsbtodb 3
minfree 8%      optim   time    symlinklen 120
maxbsize 32768  maxbpg  4096    maxcontig 4     contigsumsize 4
nbfree  2413023 ndir    229     nifree  11153267        nffree  274
bpg     20035   fpg     160280  ipg     80256   unrefs  0
nindir  4096    inopb   128     maxfilesize     2252349704110079
sbsize  4096    cgsize  32768   csaddr  5056    cssize  4096
sblkno  24      cblkno  32      iblkno  40      dblkno  5056
cgrotor 123     fmod    0       ronly   0       clean   0
metaspace 6408  avgfpdir 64     avgfilesize 16384
flags   gjournal
fsmnt   /home
volname         swuid   0       providersize    22158924

2. I am quite sure that the process g_journal_switcher does not break
things during run of "mount -o snapshot". I have severel DTrace output
of the problem, where g_journal_switcher sleeps all the time.

3. The probe expunge_ufs2:return is the first probe from several DTrace
probing that shows a difference between good and bad.

4. The problem occurs for many years on several production server and
partitions where only one snapshot is taken every day, but not all and
not very often. On newer server the "free inode" messages increases. On
my test server I have a "freezed" partition /home that shows the problem
with less than ten runs all the time. If I run DTrace with a lot of
probes then the problem sometimes becomes a "Heisenbug".


Andreas Longwitz


_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Andreas Longwitz
Hello,

I like to discuss the following code snippet from function
indiracct_ufs2() in source file ffs_snapshot.c:

    /*
     * We have to expand bread here since it will deadlock looking
     * up the block number for any blocks that are not in the cache.
     */
    bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
    bp->b_blkno = fsbtodb(fs, blkno);
    if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
        (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
            brelse(bp);
            return (error);
    }

In all my tests (all mount types, good or bad result) the flags B_DONE
and B_DELWRI in bp->b_flags were alway zero, so the data buffer given by
getblk() at bp->b_data is always overwritten with the data from the
explicit I/O performed by readblock(). By the way the flag B_CACHE was
always set. I have saved the buffer given by getblk() and compared this
buffer with the data read by readblock(). Without gjournal there is
never a difference. Using a gjournaled partition  things are more
sophisticated. Taking a snapshot in my example environment needs 678
calls of getblk(). In the "good case" all I/O's performed by readblock()
gave the same data as getblk(), in the "bad case" there are some
differences between the data in buffer cache seen by getblk() and the
data on disk seen by readblock(). Mostly two or three of the 678 blocks
are different. In this case the block read by readblock() is always
zero, the block read by getblk() has a lot of BLK_NOCOPY values,
sometimes 4096 (a full block).

There is one other flag in bp->b_flags given by getblk that is of
interest: B_CLUSTEROK. This flag is always in sync with the flag
BV_SCANNED in bp->b_vflags. In the above code snippet I can check this
flag too before going to disk:

   if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ...

With this patch the problem has gone in all my tests. The resulting
snapblklist written to the end of the snapfile is always the same.

I am not familiar with buffer and memory management of FreeBSD, so I can
not explain what is going on. But I see in cgaccount() a lot of
BLK_NOCOPY values are set and in expunge_ufs2() these values should be
seen again. It seems they are in the buffer cache but sometimes not on
disk. The patch given above in my test example reads data from buffer
cache instead of from disk 141 times.

All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the
gjournal switcher runs all the time but syncs the same partition with
MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know
if this can lead to a race condition between gjournal and snapshot.


>> Kirk McKusick wrote:
>>> Date: Sat, 24 Jun 2017 18:30:17 +0300
>>> From: Konstantin Belousov <[hidden email]>
>>> To: Andreas Longwitz <[hidden email]>
>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>
>>> On Fri, Jun 23, 2017 at 11:14:43AM +0200, Andreas Longwitz wrote:
>>>> I try to understand the cause for the "free inode" problem described in
>>>>   https://lists.freebsd.org/pipermail/freebsd-fs/2013-November/018610.html
>>>>
>>>> I have setup a test server (FreeBSD 10.3-STABLE #4 r317936) with a
>>>> gjournaled partition for /home:
>>>> mount -t ufs | grep /home -->
>>>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local,
>>>> noatime, gjournal)
>>> As the first thing to try, if you perform your tests on the raw
>>> partition without gjournal, does the problem stay around ?
>> I concur that running your test without gjournal is the next test to try.
>> I think that your suspicion that there is a race condition with gjournal
>> is likely correct. And if that is true, then the problem will go away
>> when gjournal is taken out of the stack.
>
> The problem only accurs when gjournal is active (tunefs -J) independent
> of the three possible mount options I have tested: async, noasync, sync.
> For my test I prefer async mount option as stated in gjournal(8). Using
> soft updates instead of gjournal is ok in all cases.
>
>>>> My test creates one snapshot of /home (gets alway inode 4) and removes
>>>> this snapshot:
>>>>
>>>>   for i in 1 2 3 4 5 6 7 8; do
>>>>      echo "starting snaptest $i" >/dev/console
>>>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>>>      rm -f /home/.snap/fscktest
>>>>   done
>>>>
>>>> I never have more than this one snapshot at work and during the test I
>>>> never have any other
>>>> user processes working on /home. A typical output looks like this:
>>>>
>>>> Jun 21 15:59:52 root: starting snaptest 1
>>>> Jun 21 15:59:52 root: 4 26592 -r--------  1 root  operator  90762970240
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:53 root: starting snaptest 2
>>>> Jun 21 15:59:53 root: 4 26592 -r--------  1 root  operator  90762970152
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:54 kernel: freeing inode /home/4 with 704 blocks
>>>> Jun 21 15:59:54 root: starting snaptest 3
>>>> Jun 21 15:59:54 kernel: free inode /home/4 had 704 blocks
>>>> Jun 21 15:59:54 root: 4 26592 -r--------  1 root  operator  90762969976
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:56 kernel: freeing inode /home/4 with 2112 blocks
>>>> Jun 21 15:59:56 root: starting snaptest 4
>>>> Jun 21 15:59:56 kernel: free inode /home/4 had 2112 blocks
>>>> Jun 21 15:59:56 root: 4 26592 -r--------  1 root  operator  90762970240
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:57 root: starting snaptest 5
>>>> Jun 21 15:59:57 root: 4 26592 -r--------  1 root  operator  90762970240
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:58 root: starting snaptest 6
>>>> Jun 21 15:59:58 root: 4 26592 -r--------  1 root  operator  90762970216
>>>> 21 Jun 15:59 /home/.snap/fscktest
>>>> Jun 21 15:59:59 kernel: freeing inode /home/4 with 192 blocks
>>>> Jun 21 15:59:59 root: starting snaptest 7
>>>> Jun 21 15:59:59 kernel: free inode /home/4 had 192 blocks
>>>> Jun 21 15:59:59 root: 4 26592 -r--------  1 root  operator  90762970240
>>>> 21 Jun 16:00 /home/.snap/fscktest
>>>> Jun 21 16:00:00 root: starting snaptest 8
>>>> Jun 21 16:00:00 root: 4 26592 -r--------  1 root  operator  90762970240
>>>> 21 Jun 16:00 /home/.snap/fscktest
>>>>
>>>> The "free inode /home/4 had NNN blocks" message during run of the mount
>>>> command is output of ffs_valloc(), because ffs_load_inode() has load the
>>>> disk inode 4 with a non zero i_blocks field. The corresponding "freeing
>>>> inode /home/4 with NNN blocks" message during the previous rm command
>>>> is output of my following diagnostic patch in function ffs_truncate():
>>>>
>>>> --- ffs_inode.c.1st     2016-06-08 17:25:21.000000000 +0200
>>>> +++ ffs_inode.c 2017-06-19 10:02:07.145360000 +0200
>>>> @@ -551,6 +551,9 @@
>>>>                 DIP_SET(ip, i_blocks, DIP(ip, i_blocks) - blocksreleased);
>>>>         else    /* sanity */
>>>>                 DIP_SET(ip, i_blocks, 0);
>>>> +       if (bootverbose == 2 && DIP(ip, i_blocks) > 0)
>>>> +               printf("freeing inode %s/%lu with %ld blocks\n",
>>>> +                       fs->fs_fsmnt, (u_long)ip->i_number,
>>>> (long)DIP(ip, i_blocks));
>>>>         ip->i_flag |= IN_CHANGE;
>>>>  #ifdef QUOTA
>>>>         (void) chkdq(ip, -blocksreleased, NOCRED, 0);
>>>>
>>>> The rm command can only free all the blocks of the snapshotfile (means
>>>> i_blocks for inode 4 ends with zero) , if this file has the "correct" size:
>>>>
>>>> ls -ils /home/.snap/fscktest -->
>>>> 4 53184 -r--------  1 root  operator  90762970240 Jun 17 06:15
>>>> /home/.snap/fscktest
>>>>
>>>> The size of the /home partition is given by
>>>> diskinfo /dev/mirror/gmsvt7p10.journal -->
>>>> /dev/mirror/gmsvt7p10.journal   512     90762954240     177271395
>>>>
>>>> So we have 2769865 full 32kB blocks with size 90631864320. During
>>>> creating a snapshot a "last block" (32kB) is written at this offset
>>>> ending at 90762969088. Finally the snapblklist is written with
>>>> VOP_WRITE: "Write out the list of allocated blocks to the end of the
>>>> snapshot". In all my correct tests the table snapblklist is 1152 bytes
>>>> in size giving the correct size of the snapshot file : 90762970240. In
>>>> this case the table snapblklist has 144 entries of length 8: one lenght
>>>> entry and 143 logical block numbers recorded in mapacct_ufs2():
>>>>
>>>>           if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
>>>>                    *ip->i_snapblklist++ = lblkno;
>>>>
>>>> The console output above shows three error situations with block
>>>> counters 704, 2112 and 192. Dividing these values by 8 gives exactly the
>>>> reduced size of the snapblocklist at the end of the snapshotfile, so in
>>>> these cases the snapshotfile is corrupt.
>>> I am not sure what do you mean by 'match' there.  Could you explicitely
>>> mention what relations between snapshot size and leaked blocks of the
>>> free snapshot inode did you noted ?
>> I too am confused here. Are you saying for example that 192 / 8 == 24
>> and that the snapblocklist is short by 24 blocks? Because from the table
>> below, it appears to be short by only 3 blocks.
>
> In this example the snapblocklist is short by 24 bytes (= 3 table
> entries) and 24/8 = 3 calls to ffs_blkfree are missing.
>
> Ok, I give one more table to explain a little better:
>
> The meaning of the columns:
>   #1 number of "free inode blocks" in kernel message
>   #2 blocks from #1 divides by 8 (= fs_frag)
>   #3 size of snapblklist write at the end of snapshotfile
>   #4 ls -s of snapshotfile: 907629...
>   #5 enties in table snapblklist, we have #5 = #3 / 8, because each
>      entry in the table is of type ufs2_daddr_t
>   #6 counter ct_blkfree, thats the number of calls to ffs_blkfree().
>   #7 missing table entries in snapblklist and calls of ffs_blkfree().
>
> test     #1   #2    #3        #4    #5   #6  #7
> -----------------------------------------------
> 1 ok      0    0  1152   ..70240   144  821   0
> 2 bad   704   88  1064   ..70152   133  810  11
> 3 bad  2112  264   888   ..69976   111  788  33
> 6 bad   192   24  1128   ..70216   141  818   3
>
> The difference of #5 and #6 is the variable acctit which is set only for
> blkno != -1. In all my test (ok and bad) the function mapacct_ufs2() is
> called 680 times, the first calls look like this (diffblk = lastblk -
> oldblk):
>
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0xc, lblkno=0
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x3, lblkno=-1
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=12
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x2a4, lblkno=-1
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=4108
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=8204
>  mapacct_ufs2:entry   /home, inum=4, diffblkp=0x1000, lblkno=12300
>  ....
>
> There are exact two calls with lblkno=-1 with together 0x2a7 -2 = 677
> runs of the for loop where #5 is skipped.
>
>>>> I use a test kernel with some extra counters ct_* in mapacct_ufs2():
>>>>
>>>>            ++ct_blkno_all;
>>>>            if (blkno == 0)
>>>>                    ++ct_blkno_0;
>>>>            if (blkno == BLK_NOCOPY)
>>>>                    ++ct_blkno_nocopy;
>>>>            if (blkno == BLK_SNAP)
>>>>                    ++ct_blkno_snap;
>>>>            if (blkno == 0 || blkno == BLK_NOCOPY)
>>>>                    continue;
>>>>            if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP) {
>>>>                    *ip->i_snapblklist++ = lblkno;
>>>>                    ++ct_snapblklist;
>>>>            }
>>>>            if (blkno == BLK_SNAP)
>>>>                    blkno = blkstofrags(fs, lblkno);
>>>>            ++ct_blkfree;
>>>>            ffs_blkfree(ip->i_ump, fs, vp, blkno, fs->fs_bsize, inum,
>>>>                vp->v_type, NULL);
>>>>
>>>> and for the 8 test runs shown above I can see these results using DTrace
>>>> at probe expunge_ufs2:return (blkno_snap is always 0):
>>>>
>>>>  test   blkno_all blkno_0 blkno_nocopy snapblklist blkfree cg_nocopy
>>>>  -------------------------------------------------------------------
>>>>  1 ok   2770545   353320  2416404      143         821     2416404
>>>>  2 bad  2770545   587860  2181875      132         810     2416404
>>>>  3 bad  2770545   956582  1813175      110         788     2416393
>>>>  4 ok   2770545   353364  2416360      143         821     2416360
>>>>  5 ok   2770545   353364  2416360      143         821     2416360
>>>>  6 bad  2770545   418376  2351351      140         818     2416360
>>>>  7 ok   2770545   353367  2416357      143         821     2416357
>>>>  8 ok   2770545   353367  2416357      143         821     2416357
>>>>
>>>> For correct tests the sum of blkno_0 and blkno_nocopy is always the same
>>>> (2769724), for bad tests especially the counter for blkno_nocopy is
>>>> significant lower. In the test table I give one more column cg_nocopy
>>>> for a counter I have added in cgaccount() to see how many entries are
>>>> set to BLK_NOCOPY during copy of cylinder group maps:
>>>>
>>>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>>>                         ++ct_cg_nocopy;
>>>>                         DIP_SET(ip, i_db[loc], BLK_NOCOPY);
>>>>                 }
>>>>                 ...
>>>>                 if (ffs_isblock(fs, cg_blksfree(cgp), loc)) {
>>>>                         ++ct_cg_nocopy;
>>>>                         ((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
>>>> BLK_NOCOPY;
>>>>                 }
>>>>
>>>> For correct tests all the BLK_NOCOPY's which are set in cgaccount() can
>>>> later be seen in mapacct_ufs2(), for bad tests many of the BLK_NOCOPY's
>>>> have changed to 0.
>>>>
>>>> I looks like the rm command removing the previous snapshot in some way
>>>> runs "in the background" simultan to expunge_ufs2() and changes some of
>>>> the BLK_NOCOPY's to zero. So this may be a buffer management problem
>>>> which only exists on gjourneled partitions, maybe getblk/readblock used
>>>> in indiracct_ufs2() is not compatibel with gjournel in the special case
>>>> of creating or removing a spapshot. A hint in this direction is the
>>>> fact, that the first test after cleaning the partition with
>>>>         umount /home; fsck -y /home; mount /home
>>>> always succeeds. The following modified test procedure never fails:
>>>>
>>>>   for i in 1 2 3 4 5 6 7 8; do
>>>>      echo "starting snaptest $i" >/dev/console
>>>>      mount -u -o snapshot -o noatime -o async /home/.snap/fscktest /home
>>>>      echo $(ls -ils /home/.snap/fscktest) >/dev/console
>>>>      rm -f /home/.snap/fscktest /home
>>>>      umount /home
>>>>      mount /home
>>>>   done
>>> After the allocations of required blocks for the snapshot inode
>>> are finished, the filesystem is suspended. You can see the
>>> call to vfs_write_suspend() in the ffs_snapshot() where the
>>> suspension is enforced. As part of the suspension, all soft-update
>>> workitems are flushed, this is done by vfs_write_suspend() calling
>>> VFS_SYNC(MNT_SUSPEND).
>>>
>>> UFS classifies writers into primary and secondary. Primary are mostly
>>> the writes initiated by the top-level syscall entries, like direct
>>> calls to write(2) or metadata-changing ops mkdir(), create() and so on.
>>> Secondary are writes performed when system initiates metadata updates
>>> during inactivation, quota updates, softdep background processing and
>>> similar. Primary modifications are blocked outright on suspension, while
>>> secondary are waited to finish in the mentioned VFS_SYNC(MNT_SUSPEND)
>>> call.
>>>
>>> If you can provide a proof that some SU-related activity happens after the
>>> suspension is established, this would be interesting to see.
>>> Might be it is something different and much simpler, but I do not see
>>> an obvious mistake in the code, after reading your observations.
>> The mount information at the top shows:
>>
>> /dev/mirror/gmsvt7p10.journal on /home (ufs, asynchronous, local, noatime, gjournal)
>>
>> Thus no soft updates are being used. We are running on just a basic UFS
>> filesystem. So, soft update processing has no relevance here.
>>
>>>> Another proof that the snapshot file is corrupt when the snapblklist is
>>>> shortend is the fact that the rm command sporadically panics in a kernel
>>>> routine that is known to be correct:
>>>>
>>>> nread portion of the kernel message buffer:
>>>> dev = mirror/gmsvt7p10.journal, block = 19727560, fs = /home
>>>> panic: ffs_blkfree_cg: freeing free block
>>>> cpuid = 1
>>>> KDB: stack backtrace:
>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>>>> 0xfffffe0857e3b1c0
>>>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0857e3b270
>>>> vpanic() at vpanic+0x126/frame 0xfffffe0857e3b2b0
>>>> panic() at panic+0x43/frame 0xfffffe0857e3b310
>>>> ffs_blkfree_cg() at ffs_blkfree_cg+0x5c6/frame 0xfffffe0857e3b3d0
>>>> ffs_blkfree() at ffs_blkfree+0x99/frame 0xfffffe0857e3b430
>>>> ffs_indirtrunc() at ffs_indirtrunc+0x474/frame 0xfffffe0857e3b510
>>>> ffs_indirtrunc() at ffs_indirtrunc+0x423/frame 0xfffffe0857e3b5f0
>>>> ffs_truncate() at ffs_truncate+0x10b4/frame 0xfffffe0857e3b7d0
>>>> ufs_inactive() at ufs_inactive+0x16b/frame 0xfffffe0857e3b810
>>>> VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xf7/frame 0xfffffe0857e3b840
>>>> vinactive() at vinactive+0xc6/frame 0xfffffe0857e3b890
>>>> vputx() at vputx+0x27a/frame 0xfffffe0857e3b8f0
>>>> kern_unlinkat() at kern_unlinkat+0x243/frame 0xfffffe0857e3bae0
>>>> amd64_syscall() at amd64_syscall+0x2c6/frame 0xfffffe0857e3bbf0
>>>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0857e3bbf0
>>>> --- syscall (10, FreeBSD ELF64, sys_unlink), rip = 0x80095425a, rsp =
>>>> 0x7fffffffe988, rbp = 0x7fffffffea20 ---
>>>>
>>>> Any hints solving this problem are welcome.
>> Per the suggestion at the top, I recommend trying your test without
>> gjournal present to see if the bug goes away. If that is true, then
>> we can focus our attention on a possible race during rm in the gjournal
>> code.
>>
>> Kirk McKusick
> I have done as explained above. Some more hints may help for debugging
> this.
>
> 1. Output of dumpfs of the partition is
>
> magic   19540119 (UFS2) time    Sun Jun 25 03:46:09 2017
> superblock location     65536   id      [ 561bcaa3 624b156b ]
> ncg     139     size    22158924        blocks  21459451
> bsize   32768   shift   15      mask    0xffff8000
> fsize   4096    shift   12      mask    0xfffff000
> frag    8       shift   3       fsbtodb 3
> minfree 8%      optim   time    symlinklen 120
> maxbsize 32768  maxbpg  4096    maxcontig 4     contigsumsize 4
> nbfree  2413023 ndir    229     nifree  11153267        nffree  274
> bpg     20035   fpg     160280  ipg     80256   unrefs  0
> nindir  4096    inopb   128     maxfilesize     2252349704110079
> sbsize  4096    cgsize  32768   csaddr  5056    cssize  4096
> sblkno  24      cblkno  32      iblkno  40      dblkno  5056
> cgrotor 123     fmod    0       ronly   0       clean   0
> metaspace 6408  avgfpdir 64     avgfilesize 16384
> flags   gjournal
> fsmnt   /home
> volname         swuid   0       providersize    22158924
>
> 2. I am quite sure that the process g_journal_switcher does not break
> things during run of "mount -o snapshot". I have severel DTrace output
> of the problem, where g_journal_switcher sleeps all the time.
>
> 3. The probe expunge_ufs2:return is the first probe from several DTrace
> probing that shows a difference between good and bad.
>
> 4. The problem occurs for many years on several production server and
> partitions where only one snapshot is taken every day, but not all and
> not very often. On newer server the "free inode" messages increases. On
> my test server I have a "freezed" partition /home that shows the problem
> with less than ten runs all the time. If I run DTrace with a lot of
> probes then the problem sometimes becomes a "Heisenbug".
>

Dr. Andreas Longwitz

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Kirk McKusick-2
I am copying Kostik on this email even though he is on freebsd-fs
as I would like him to review my answer below.

> Date: Sun, 16 Jul 2017 17:52:39 +0200
> From: Andreas Longwitz <[hidden email]>
> To: Kirk McKusick <[hidden email]>
> CC: [hidden email]
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>
> Hello,
>
> I like to discuss the following code snippet from function
> indiracct_ufs2() in source file ffs_snapshot.c:
>
>     /*
>      * We have to expand bread here since it will deadlock looking
>      * up the block number for any blocks that are not in the cache.
>      */
>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>     bp->b_blkno = fsbtodb(fs, blkno);
>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>             brelse(bp);
>             return (error);
>     }
>
> In all my tests (all mount types, good or bad result) the flags B_DONE
> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
> getblk() at bp->b_data is always overwritten with the data from the
> explicit I/O performed by readblock(). By the way the flag B_CACHE was
> always set.

Key observation here that B_CACHE is set! That means that the data
in the block is valid and does not need to be read again. So the fix
is this:

Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
--- sys/ufs/ffs/ffs_snapshot.c (revision 321045)
+++ sys/ufs/ffs/ffs_snapshot.c (working copy)
@@ -1394,7 +1394,7 @@
  */
  bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
  bp->b_blkno = fsbtodb(fs, blkno);
- if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+ if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
     (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
  brelse(bp);
  return (error);

Absent gjournal the extra read was unnecessary but harmless. The
buffer exists with B_DELWRI so we will not read it, or it is locked
while being written to disk, so we will wait for the write to complete,
after which it is safe to read it. With gjournal there is a window of
vulnerability between when the buffer is marked B_DELWRI and when it
is actually written to disk. This is because gjournal will say the
write has completed when it has not yet actually done so. Thus when
we read it, we get incorrect results. But until gjournal actually
completes the write, it will keep the buffer around (marked B_CACHE),
so as long as we check for B_CACHE to avoid the read (as we should
anyway to avoid an unnecessary read) we should always get consistent
results.

> I have saved the buffer given by getblk() and compared this buffer
> with the data read by readblock(). Without gjournal there is never
> a difference. Using a gjournaled partition  things are more
> sophisticated. Taking a snapshot in my example environment needs
> 678 calls of getblk(). In the "good case" all I/O's performed by
> readblock() gave the same data as getblk(), in the "bad case" there
> are some differences between the data in buffer cache seen by
> getblk() and the data on disk seen by readblock(). Mostly two or
> three of the 678 blocks are different. In this case the block read
> by readblock() is always zero, the block read by getblk() has a lot
> of BLK_NOCOPY values, sometimes 4096 (a full block).
>
> There is one other flag in bp->b_flags given by getblk that is of
> interest: B_CLUSTEROK. This flag is always in sync with the flag
> BV_SCANNED in bp->b_vflags. In the above code snippet I can check this
> flag too before going to disk:
>
>    if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ...
>
> With this patch the problem has gone in all my tests. The resulting
> snapblklist written to the end of the snapfile is always the same.

It is merely coincidence that B_CLUSTEROK is set. But the fact that
skipping the read fixes your problem leads me to believe that checking
for B_CACHE will fix your problem.

> I am not familiar with buffer and memory management of FreeBSD, so I can
> not explain what is going on. But I see in cgaccount() a lot of
> BLK_NOCOPY values are set and in expunge_ufs2() these values should be
> seen again. It seems they are in the buffer cache but sometimes not on
> disk. The patch given above in my test example reads data from buffer
> cache instead of from disk 141 times.

The number of buffer flags have grown over time and their meaning has
subtlely changed. Even I who have been around since the beginning cannot
claim to fully comprehend exactly what they mean. So, the fact that you
find them incomprehensible is unfortunately all too understandable.

> All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the
> gjournal switcher runs all the time but syncs the same partition with
> MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know
> if this can lead to a race condition between gjournal and snapshot.

I went down this line of thinking with your first email and did not come
up with anything that explained the problem. So I am glad that you have
found something that I do believe explains what is going on.

        Kirk McKusick
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Konstantin Belousov
On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:

> I am copying Kostik on this email even though he is on freebsd-fs
> as I would like him to review my answer below.
>
> > Date: Sun, 16 Jul 2017 17:52:39 +0200
> > From: Andreas Longwitz <[hidden email]>
> > To: Kirk McKusick <[hidden email]>
> > CC: [hidden email]
> > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> >
> > Hello,
> >
> > I like to discuss the following code snippet from function
> > indiracct_ufs2() in source file ffs_snapshot.c:
> >
> >     /*
> >      * We have to expand bread here since it will deadlock looking
> >      * up the block number for any blocks that are not in the cache.
> >      */
> >     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
> >     bp->b_blkno = fsbtodb(fs, blkno);
> >     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> >         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
> >             brelse(bp);
> >             return (error);
> >     }
> >
> > In all my tests (all mount types, good or bad result) the flags B_DONE
> > and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
> > getblk() at bp->b_data is always overwritten with the data from the
> > explicit I/O performed by readblock(). By the way the flag B_CACHE was
> > always set.
>
> Key observation here that B_CACHE is set! That means that the data
> in the block is valid and does not need to be read again. So the fix
> is this:
>
> Index: sys/ufs/ffs/ffs_snapshot.c
> ===================================================================
> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045)
> +++ sys/ufs/ffs/ffs_snapshot.c (working copy)
> @@ -1394,7 +1394,7 @@
>   */
>   bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>   bp->b_blkno = fsbtodb(fs, blkno);
> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>      (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>   brelse(bp);
>   return (error);
>
> Absent gjournal the extra read was unnecessary but harmless. The
> buffer exists with B_DELWRI so we will not read it, or it is locked
> while being written to disk, so we will wait for the write to complete,
> after which it is safe to read it. With gjournal there is a window of
> vulnerability between when the buffer is marked B_DELWRI and when it
> is actually written to disk. This is because gjournal will say the
> write has completed when it has not yet actually done so. Thus when
> we read it, we get incorrect results. But until gjournal actually
> completes the write, it will keep the buffer around (marked B_CACHE),
> so as long as we check for B_CACHE to avoid the read (as we should
> anyway to avoid an unnecessary read) we should always get consistent
> results.
I tend to disagree with the fix. What if the cached block being
discarded before we enter this code fragment ?

I believe the bug is indeed clear after your analysis, but it is in
gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
it is all disk) returns data different from what was written to the
volume.  After the write bio signalled completion, unless other write
was done after it, reads _must_ return the content of the write.

The right fix is for gjournal is to perform 'store forwarding', i.e.
to return data from the journal until journal is flushd to disk.

>
> > I have saved the buffer given by getblk() and compared this buffer
> > with the data read by readblock(). Without gjournal there is never
> > a difference. Using a gjournaled partition  things are more
> > sophisticated. Taking a snapshot in my example environment needs
> > 678 calls of getblk(). In the "good case" all I/O's performed by
> > readblock() gave the same data as getblk(), in the "bad case" there
> > are some differences between the data in buffer cache seen by
> > getblk() and the data on disk seen by readblock(). Mostly two or
> > three of the 678 blocks are different. In this case the block read
> > by readblock() is always zero, the block read by getblk() has a lot
> > of BLK_NOCOPY values, sometimes 4096 (a full block).
> >
> > There is one other flag in bp->b_flags given by getblk that is of
> > interest: B_CLUSTEROK. This flag is always in sync with the flag
> > BV_SCANNED in bp->b_vflags. In the above code snippet I can check this
> > flag too before going to disk:
> >
> >    if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ...
> >
> > With this patch the problem has gone in all my tests. The resulting
> > snapblklist written to the end of the snapfile is always the same.
>
> It is merely coincidence that B_CLUSTEROK is set. But the fact that
> skipping the read fixes your problem leads me to believe that checking
> for B_CACHE will fix your problem.
>
> > I am not familiar with buffer and memory management of FreeBSD, so I can
> > not explain what is going on. But I see in cgaccount() a lot of
> > BLK_NOCOPY values are set and in expunge_ufs2() these values should be
> > seen again. It seems they are in the buffer cache but sometimes not on
> > disk. The patch given above in my test example reads data from buffer
> > cache instead of from disk 141 times.
>
> The number of buffer flags have grown over time and their meaning has
> subtlely changed. Even I who have been around since the beginning cannot
> claim to fully comprehend exactly what they mean. So, the fact that you
> find them incomprehensible is unfortunately all too understandable.
>
> > All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the
> > gjournal switcher runs all the time but syncs the same partition with
> > MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know
> > if this can lead to a race condition between gjournal and snapshot.
>
> I went down this line of thinking with your first email and did not come
> up with anything that explained the problem. So I am glad that you have
> found something that I do believe explains what is going on.
>
> Kirk McKusick
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Andreas Longwitz
> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>> I am copying Kostik on this email even though he is on freebsd-fs
>> as I would like him to review my answer below.
>>
>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>> From: Andreas Longwitz <[hidden email]>
>>> To: Kirk McKusick <[hidden email]>
>>> CC: [hidden email]
>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>
>>> Hello,
>>>
>>> I like to discuss the following code snippet from function
>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>
>>>     /*
>>>      * We have to expand bread here since it will deadlock looking
>>>      * up the block number for any blocks that are not in the cache.
>>>      */
>>>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>     bp->b_blkno = fsbtodb(fs, blkno);
>>>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>             brelse(bp);
>>>             return (error);
>>>     }
>>>
>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>> getblk() at bp->b_data is always overwritten with the data from the
>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>> always set.
>> Key observation here that B_CACHE is set! That means that the data
>> in the block is valid and does not need to be read again. So the fix
>> is this:
>>
>> Index: sys/ufs/ffs/ffs_snapshot.c
>> ===================================================================
>> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045)
>> +++ sys/ufs/ffs/ffs_snapshot.c (working copy)
>> @@ -1394,7 +1394,7 @@
>>   */
>>   bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>   bp->b_blkno = fsbtodb(fs, blkno);
>> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>      (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>   brelse(bp);
>>   return (error);
>>

This patch was my first try, because the comment in the code snippet
points to B_CACHE. Unfortunaly it did not work, because the getblk()
buffer does not always have the correct data, the following "rm
snapfile" crashed always immediately with

  panic: ffs_blkfree_cg: freeing free block.

Therfore I came up with the CLUSTEROK hack.

>> Absent gjournal the extra read was unnecessary but harmless. The
>> buffer exists with B_DELWRI so we will not read it, or it is locked
>> while being written to disk, so we will wait for the write to complete,
>> after which it is safe to read it. With gjournal there is a window of
>> vulnerability between when the buffer is marked B_DELWRI and when it
>> is actually written to disk. This is because gjournal will say the
>> write has completed when it has not yet actually done so. Thus when
>> we read it, we get incorrect results. But until gjournal actually
>> completes the write, it will keep the buffer around (marked B_CACHE),
>> so as long as we check for B_CACHE to avoid the read (as we should
>> anyway to avoid an unnecessary read) we should always get consistent
>> results.
> I tend to disagree with the fix. What if the cached block being
> discarded before we enter this code fragment ?
>
> I believe the bug is indeed clear after your analysis, but it is in
> gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
> it is all disk) returns data different from what was written to the
> volume.  After the write bio signalled completion, unless other write
> was done after it, reads _must_ return the content of the write.
>
> The right fix is for gjournal is to perform 'store forwarding', i.e.
> to return data from the journal until journal is flushd to disk.

That makes sense to me and explains why the problem never occurs after a
clean mount, it needs the foregoing "rm snapfile" to trigger the
gjournal bug.

To be complete:
1. I run gjournal above gmirror.
2. I use the patch given in kern/198500.
3. In my loader.conf I have
   # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
   kern.geom.journal.cache.divisor="4"
It is a little bit annoying, but setting kern.geom.journal.cache.limit
in loader.conf does not work because its overwritten in gjournal_init().

--
Dr. Andreas Longwitz

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Kirk McKusick-2
> Date: Mon, 17 Jul 2017 10:14:57 +0200
> From: Andreas Longwitz <[hidden email]>
> To: Konstantin Belousov <[hidden email]>
> CC: Kirk McKusick <[hidden email]>, [hidden email]
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>
>> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>>> I am copying Kostik on this email even though he is on freebsd-fs
>>> as I would like him to review my answer below.
>>>
>>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>>> From: Andreas Longwitz <[hidden email]>
>>>> To: Kirk McKusick <[hidden email]>
>>>> CC: [hidden email]
>>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>>
>>>> Hello,
>>>>
>>>> I like to discuss the following code snippet from function
>>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>>
>>>>     /*
>>>>      * We have to expand bread here since it will deadlock looking
>>>>      * up the block number for any blocks that are not in the cache.
>>>>      */
>>>>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>>     bp->b_blkno = fsbtodb(fs, blkno);
>>>>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>>>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>>             brelse(bp);
>>>>             return (error);
>>>>     }
>>>>
>>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>>> getblk() at bp->b_data is always overwritten with the data from the
>>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>>> always set.
>>>
>>> Key observation here that B_CACHE is set! That means that the data
>>> in the block is valid and does not need to be read again. So the fix
>>> is this:
>>>
>>> Index: sys/ufs/ffs/ffs_snapshot.c
>>> ===================================================================
>>> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045)
>>> +++ sys/ufs/ffs/ffs_snapshot.c (working copy)
>>> @@ -1394,7 +1394,7 @@
>>>   */
>>>   bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>   bp->b_blkno = fsbtodb(fs, blkno);
>>> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>>      (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>   brelse(bp);
>>>   return (error);
>>>
>
> This patch was my first try, because the comment in the code snippet
> points to B_CACHE. Unfortunaly it did not work, because the getblk()
> buffer does not always have the correct data, the following "rm
> snapfile" crashed always immediately with
>
>   panic: ffs_blkfree_cg: freeing free block.
>
> Therfore I came up with the CLUSTEROK hack.

Kostik's comment below that this is not a complete solution is
correct. Once gjournal has put the block in its log it releases
the buffer and it may fall out of the cache even though the log
has not been written to backing store. So, it may still be necessary
to get the contents filled. But, if B_CACHE is set, then the contents
should be valid as that is exactly what bread uses to decide if the
buffer needs to be read (see breadn_flags() in /sys/kern/vfs_bio.c).
So, I really do not understand how you can end up with a buffer with
invalid contents if B_CACHE is set because we are open coding bread()
here and that is the criterion it uses to read. If however, you do
the read when B_CACHE is not set, then as noted you will get the
wrong results and crash. But see also my comments below when B_CACHE
is not set.

>>> Absent gjournal the extra read was unnecessary but harmless. The
>>> buffer exists with B_DELWRI so we will not read it, or it is locked
>>> while being written to disk, so we will wait for the write to complete,
>>> after which it is safe to read it. With gjournal there is a window of
>>> vulnerability between when the buffer is marked B_DELWRI and when it
>>> is actually written to disk. This is because gjournal will say the
>>> write has completed when it has not yet actually done so. Thus when
>>> we read it, we get incorrect results. But until gjournal actually
>>> completes the write, it will keep the buffer around (marked B_CACHE),
>>> so as long as we check for B_CACHE to avoid the read (as we should
>>> anyway to avoid an unnecessary read) we should always get consistent
>>> results.
>> I tend to disagree with the fix. What if the cached block being
>> discarded before we enter this code fragment ?
>>
>> I believe the bug is indeed clear after your analysis, but it is in
>> gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
>> it is all disk) returns data different from what was written to the
>> volume.  After the write bio signalled completion, unless other write
>> was done after it, reads _must_ return the content of the write.
>>
>> The right fix is for gjournal is to perform 'store forwarding', i.e.
>> to return data from the journal until journal is flushd to disk.
>
> That makes sense to me and explains why the problem never occurs after a
> clean mount, it needs the foregoing "rm snapfile" to trigger the
> gjournal bug.
>
> To be complete:
> 1. I run gjournal above gmirror.
> 2. I use the patch given in kern/198500.
> 3. In my loader.conf I have
>    # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
>    kern.geom.journal.cache.divisor="4"
> It is a little bit annoying, but setting kern.geom.journal.cache.limit
> in loader.conf does not work because its overwritten in gjournal_init().
>
> --
> Dr. Andreas Longwitz

The sequence of calls when using bread is:

Function Line File
-------- ---- ----
bread 491 sys/buf.h
breadn_flags 1814 kern/vfs_bio.c
bstrategy 397 sys/buf.h
BO_STRATEGY 86 sys/bufobj.h
bufstrategy 4535 kern/vfs_bio.c
ufs_strategy 2290 ufs/ufs/ufs_vnops.c
BO_STRATEGY on filesystem device -> ffs_geom_strategy
ffs_geom_strategy 2141 ufs/ffs/ffs_vfsops.c
g_vfs_strategy 161 geom/geom_vfs.c
g_io_request 470 geom/geom_io.c

Whereas readblock skips all these steps and calls g_io_request
directly. In my looking at the gjournal code, I believe that we
will still enter it with the g_io_request() call as I believe that
it does not hook itself into any of the VOP_ call structure. but I
have not done a backtrace to confirm this fact. Assuming that we
are still getting into g_journal_start(), then it should be possible
to catch reads that are only in the log and pull out the data as
needed.

Another alternative is to send gjournal a request to flush its log
before starting the removal of a snapshot.

        Kirk McKusick
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ufs snapshot is sometimes corrupt on gjourneled partition

Konstantin Belousov
On Mon, Jul 17, 2017 at 05:44:20PM -0700, Kirk McKusick wrote:

> The sequence of calls when using bread is:
>
> Function Line File
> -------- ---- ----
> bread 491 sys/buf.h
> breadn_flags 1814 kern/vfs_bio.c
> bstrategy 397 sys/buf.h
> BO_STRATEGY 86 sys/bufobj.h
> bufstrategy 4535 kern/vfs_bio.c
> ufs_strategy 2290 ufs/ufs/ufs_vnops.c
> BO_STRATEGY on filesystem device -> ffs_geom_strategy
> ffs_geom_strategy 2141 ufs/ffs/ffs_vfsops.c
> g_vfs_strategy 161 geom/geom_vfs.c
> g_io_request 470 geom/geom_io.c
>
> Whereas readblock skips all these steps and calls g_io_request
> directly. In my looking at the gjournal code, I believe that we
> will still enter it with the g_io_request() call as I believe that
> it does not hook itself into any of the VOP_ call structure. but I
> have not done a backtrace to confirm this fact. Assuming that we
> are still getting into g_journal_start(), then it should be possible
> to catch reads that are only in the log and pull out the data as
> needed.
>
> Another alternative is to send gjournal a request to flush its log
> before starting the removal of a snapshot.
I do not think that UFS call sequence is relevant there.  It is clearly
an underlying io device (gjournal) malfunction if it returns a data block
which is different from the latest successful written block.  As is,
whether UFS pass the read request from buffer cache by the BO_STRATEGY
layers, or directly creates bio and reads the block, is not important.

OTOH, I do not think that this is an issue that gjournal always reads
from the data area and misses journal.  The failure would be much more
spectacular in this case.  I see some gjournal code which tries to find the
data in 'cache' on read, whatever it means.  It is clearly that sometimes
it does not find the data.  The failure is probably additionally hidden
by the buffer cache eliminating most reads for recently written data.

So the way to fix the bug is to read gjournal code and understand why
does it sometime returns wrong data.  For instance, there were relatively
recent changes to geom infrastructure allowing for direct completion of
bios.  Anyway, I have no interest in gjournal.

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Loading...