Writing contigiously to UFS2?

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Writing contigiously to UFS2?

Fluffles
Hello list,

I've setup a concat of 8 disks for my new NAS, using ataidle to spindown
the disks not needed. This allows me to save power and noise/heat by
running only the drives that are actually in use.

My problem is UFS. UFS2 seems to write to 4 disks, even though all the
data written so far can easily fit on just one disk. What's going on
here? I looked at newfs parameters, but in the past was unable to make
newfs write contigiously. It seems UFS2 always writes to a new cylinder.
Is there any way to force UFS to write contigiously? Or at least limit
the problem?

If i write 400GB to a 4TB volume consisting of 8x 500GB disks, i want
all data to be on the first disk. If the data spreads, then more disks
will be 'awaken' when i read my data, which defeats the purpose of my
power-saving NAS experiment.

Any feedback is welcome. Using FreeBSD 6.2-RELEASE i386, used newfs -U
-S 2048 <device>.

- Veronica
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
Fluffles wrote:
> Hello list,
>
> I've setup a concat of 8 disks for my new NAS, using ataidle to spindown
> the disks not needed. This allows me to save power and noise/heat by
> running only the drives that are actually in use.
>
> My problem is UFS. UFS2 seems to write to 4 disks, even though all the

There 4 drives are used in what RAID form? If it's RAID0/stripe, you
can't avoid data being spread across the drives (since this is the point
of having RAID0).

> data written so far can easily fit on just one disk. What's going on
> here? I looked at newfs parameters, but in the past was unable to make
> newfs write contigiously. It seems UFS2 always writes to a new cylinder.
> Is there any way to force UFS to write contigiously? Or at least limit
> the problem?

If the drives are simply concatenated, then there might be weird
behaviour in choosing what cylinder groups to allocate for files. UFS
forces big files to be spread across cylinder groups so that no large
file fills entire cgs.



signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Fluffles
In reply to this post by Fluffles

Ivan Voras wrote:
 > There 4 drives are used in what RAID form? If it's RAID0/stripe, you
can't avoid data being spread across the drives (since this is the point
of having RAID0).

It's an array of 8 drives in gconcat, so they are using the JBOD /
spanning / concatenating scheme, which does not have a RAID designation
but rather is a bunch of disks glued to each other. Thus, there is no
striping involved. Offset 0 to 500GB will 'land' on disk0 and then disk1
takes over, in scheme:

offset 0
-------------------------------------------------------------------
offset 4TB
disk0 -> disk1 -> disk2 -> disk3 -> disk4 -> disk5 -> disk6 -> disk7

(for everyone not familiar with concatenation)


 > If the drives are simply concatenated, then there might be weird
behavior in choosing what cylinder groups to allocate for files. UFS
forces big files to be spread across cylinder groups so that no large
file fills entire cgs.

Exactly! And this is my problem. I do not like this behavior for various
reasons:
- it causes lower sequential transfer speed because the disks have to
seek regularly
- UFS causes 2 reads per second when writing sequentially, probably some
meta-data thing but i don't like it either
- files are not written contiguously which causes fragmentation,
essentially UFS forces big files to become fragmented this way.

Even worse: data is being stored at weird locations, so that my energy
efficient NAS project becomes crippled. Even with the first 400GB of
data, it's storing that on the first 4 disks in my concat configuration,
so that when opening folders i have to wait 10 seconds before the disk
is spinned up. For regular operation, multiple disk have to be spinned
up which is not practical and unnecessary. Is there any way to force UFS
to write contiguously? Else i think i should try linux with some linux
filesystem (XFS, Reiser, JFS) in the hope they do not suffer from this
problem.

In the past when testing geom_raid5 I've tried to tune newfs parameters
so that it would write contiguously but still there were regular 2-phase
writes which mean data was not written contiguously. I really dislike
this behavior.

- Veronica

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
Fluffles wrote:

> Even worse: data is being stored at weird locations, so that my energy
> efficient NAS project becomes crippled. Even with the first 400GB of
> data, it's storing that on the first 4 disks in my concat configuration,

> In the past when testing geom_raid5 I've tried to tune newfs parameters
> so that it would write contiguously but still there were regular 2-phase
> writes which mean data was not written contiguously. I really dislike
> this behavior.

I agree, this is my least favorite aspect of UFS (maybe together with
nonimplementation of extents), for various reasons. I feel it's time to
start heavy lobbying for finishing FreeBSD's implementations of XFS and
raiserfs :)

(ZFS is not the ultimate solution: 1) replacing UFS monoculture with ZFS
monoculture will sooner or later yield problems, and 2) sometimes a
"dumb" unix filesystem is preferred to the "smart" ZFS).


signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Eric Anderson-13
In reply to this post by Fluffles
Fluffles wrote:

>
> Ivan Voras wrote:
>  > There 4 drives are used in what RAID form? If it's RAID0/stripe, you
> can't avoid data being spread across the drives (since this is the point
> of having RAID0).
>
> It's an array of 8 drives in gconcat, so they are using the JBOD /
> spanning / concatenating scheme, which does not have a RAID designation
> but rather is a bunch of disks glued to each other. Thus, there is no
> striping involved. Offset 0 to 500GB will 'land' on disk0 and then disk1
> takes over, in scheme:
>
> offset 0
> -------------------------------------------------------------------
> offset 4TB
> disk0 -> disk1 -> disk2 -> disk3 -> disk4 -> disk5 -> disk6 -> disk7
>
> (for everyone not familiar with concatenation)
>
>
>  > If the drives are simply concatenated, then there might be weird
> behavior in choosing what cylinder groups to allocate for files. UFS
> forces big files to be spread across cylinder groups so that no large
> file fills entire cgs.
>
> Exactly! And this is my problem. I do not like this behavior for various
> reasons:
> - it causes lower sequential transfer speed because the disks have to
> seek regularly
> - UFS causes 2 reads per second when writing sequentially, probably some
> meta-data thing but i don't like it either
> - files are not written contiguously which causes fragmentation,
> essentially UFS forces big files to become fragmented this way.
>
> Even worse: data is being stored at weird locations, so that my energy
> efficient NAS project becomes crippled. Even with the first 400GB of
> data, it's storing that on the first 4 disks in my concat configuration,
> so that when opening folders i have to wait 10 seconds before the disk
> is spinned up. For regular operation, multiple disk have to be spinned
> up which is not practical and unnecessary. Is there any way to force UFS
> to write contiguously? Else i think i should try linux with some linux
> filesystem (XFS, Reiser, JFS) in the hope they do not suffer from this
> problem.
>
> In the past when testing geom_raid5 I've tried to tune newfs parameters
> so that it would write contiguously but still there were regular 2-phase
> writes which mean data was not written contiguously. I really dislike
> this behavior.

This notion of breaking up large blocks of data into smaller chunks is a
fundamental of the UFS (well, FFS) filesystem, and has been around for
ages.  I'm not saying it's the One True FS Format by any means, but many
many other file systems use the same principals.

The largest file size per chunk in a cylinder group is calculated at
newfs time, which determines also how many cylinder groups there should
be.  I think the largest size I've seen was something in the 460MB-ish
range, meaning any contiguous write above that would span more than one
cylinder group.

The max cylinder group size also has another bad side effect - the more
cylinder groups you have, the longer it takes a snapshot to be created.

I recommend trying msdos fs.  On recent -CURRENT, it should perform
fairly well (akin to UFS2 I think), and if I recall correctly, has a
more contiguous block layout.

In the end, extending UFS2 to support much larger cylinder group sizes
would hugely beneficial.  Instead of forcing XFS, reiserfs, JFS,
ext[23], etc, to be writable (which most of those are GPL'ed), why not
start the (immensely huge) task of a UFS3, which has support for all the
things we need for the next 5-10yrs?  UFS2 has served well from
5.x->7.x, but what about the future?

Making a UFS3 takes time, and dedication from developers.

Eric


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Stefan Esser-3
In reply to this post by Ivan Voras
Ivan Voras wrote:

> Fluffles wrote:
>
>> Even worse: data is being stored at weird locations, so that my energy
>> efficient NAS project becomes crippled. Even with the first 400GB of
>> data, it's storing that on the first 4 disks in my concat configuration,
>
>> In the past when testing geom_raid5 I've tried to tune newfs
>> parameters so that it would write contiguously but still there were
>> regular 2-phase writes which mean data was not written contiguously. I
>> really dislike this behavior.
>
> I agree, this is my least favorite aspect of UFS (maybe together with
> nonimplementation of extents), for various reasons. I feel it's time to
> start heavy lobbying for finishing FreeBSD's implementations of XFS and
> raiserfs :)
>
> (ZFS is not the ultimate solution: 1) replacing UFS monoculture with ZFS
> monoculture will sooner or later yield problems, and 2) sometimes a
> "dumb" unix filesystem is preferred to the "smart" ZFS).

Both XFS and ReiserFS are quite complex compared to UFS definitely
not well described by the term "dumb" ;-)

The FFS paper by McKusick et.al describes the historical allocation
strategy, which was somewhat modified in FreeBSD a few years ago in
order to adapt to modern disk sizes (larger cylinder groups, meaning
it is not a good idea to create each new directory in a new cylinder
group).

The code that implements the block layout strategy is easily found
in the sources and can be modified without too much risk to your
file systems consistency ...

Regards, STefan
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
Stefan Esser wrote:
> Ivan Voras wrote:

>> (ZFS is not the ultimate solution: 1) replacing UFS monoculture with ZFS
>> monoculture will sooner or later yield problems, and 2) sometimes a
>> "dumb" unix filesystem is preferred to the "smart" ZFS).
>
> Both XFS and ReiserFS are quite complex compared to UFS definitely
> not well described by the term "dumb" ;-)

Of course, I mean no disrespect to them, I've read enough papers on them
to realize their complexity :) By "dumb" I meant they behave like "point
them to a device and they will stick to it", i.e. they don't come with a
volume manager.

> The FFS paper by McKusick et.al describes the historical allocation
> strategy, which was somewhat modified in FreeBSD a few years ago in
> order to adapt to modern disk sizes (larger cylinder groups, meaning
> it is not a good idea to create each new directory in a new cylinder
> group).

[thinking out loud:]

 From experience (not from reading code or the docs) I conclude that
cylinder groups cannot be larger than around 190 MB. I know this from
numerous runnings of newfs and during development of gvirstor which
interacts with cg in an "interesting" way. I know the reasons why cgs
exist (mainly to lower latencies from seeking) but with todays drives
and memory configurations it would sometimes be nice to make them larger
or in the extreme, make just one cg that covers the entire drive.
Though, this extreme would in case of concat configurations put all of
block and inode metadata on the first drive which could have interesting
effects on performance. Of course, with seek-less drives (solid state)
there's no reason to have cgs at all.




signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
In reply to this post by Eric Anderson-13
Eric Anderson wrote:

> The largest file size per chunk in a cylinder group is calculated at
> newfs time, which determines also how many cylinder groups there should
> be.  I think the largest size I've seen was something in the 460MB-ish
> range, meaning any contiguous write above that would span more than one
> cylinder group.

Hmm, how did you manage to create a file system with such large cylinder
groups? I've experimented with smallnum-TB file systems and still
couldn't make them larger than around 190 MB (though I wasn't actively
trying, just observed how they turned out).




signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Gary Palmer-2
On Fri, Sep 21, 2007 at 02:50:14PM +0200, Ivan Voras wrote:

> Eric Anderson wrote:
>
> >The largest file size per chunk in a cylinder group is calculated at
> >newfs time, which determines also how many cylinder groups there should
> >be.  I think the largest size I've seen was something in the 460MB-ish
> >range, meaning any contiguous write above that would span more than one
> >cylinder group.
>
> Hmm, how did you manage to create a file system with such large cylinder
> groups? I've experimented with smallnum-TB file systems and still
> couldn't make them larger than around 190 MB (though I wasn't actively
> trying, just observed how they turned out).

Presumably by using the -c parameter to newfs.

The original poster might get some traction out of a combination of
-c and -e parameters to newfs, although the fundamental behaviour
will remain unchanged.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
Gary Palmer wrote:

> Presumably by using the -c parameter to newfs.

Hm, I'll try it again later but I think I concluded that -c can be used
to lower the size of cgs, not to increase it.


signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Gary Palmer-2
On Fri, Sep 21, 2007 at 03:23:20PM +0200, Ivan Voras wrote:
> Gary Palmer wrote:
>
> >Presumably by using the -c parameter to newfs.
>
> Hm, I'll try it again later but I think I concluded that -c can be used
> to lower the size of cgs, not to increase it.

A CG is basically an inode table with a block allocation bitmap to keep
track of what disk blocks are in use.  You might have to use the -i
parameter to increase the expected average file size.  That should
allow you to increase the CG size.  Its been a LONG time since I looked
at the UFS code, but I suspect the # of inodes per CG is probably capped.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Bruce Evans-4
In reply to this post by Ivan Voras
On Fri, 21 Sep 2007, Ivan Voras wrote:

> Fluffles wrote:
>
>> Even worse: data is being stored at weird locations, so that my energy
>> efficient NAS project becomes crippled. Even with the first 400GB of data,
>> it's storing that on the first 4 disks in my concat configuration,
>
>> In the past when testing geom_raid5 I've tried to tune newfs parameters so
>> that it would write contiguously but still there were regular 2-phase
>> writes which mean data was not written contiguously. I really dislike this
>> behavior.
>
> I agree, this is my least favorite aspect of UFS (maybe together with
> nonimplementation of extents), for various reasons. I feel it's time to start
> heavy lobbying for finishing FreeBSD's implementations of XFS and raiserfs :)

Why not improve the implementation of ffs?  Cylinder groups are
fundamental to ffs, and I think having too-many too-small ones is
fairly fundamental, but the allocation policy across them can be
anything, and large cylinder groups could be faked using small ones.

The current (and very old?) allocation policy for extending files is
to consider allocating the block in a new cg when the current cg has
more than fs_maxbpg blocks allocated in it (newfs and tunefs parameter
-e maxbpg: default 1/4 of the number of blocks in a cg = bpg).  Then
preference is given to the next cg with more than the average number
of free blocks.  This seems to be buggy.  From ffs_blkpref_ufs1():

% if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
% if (lbn < NDADDR + NINDIR(fs)) {
% cg = ino_to_cg(fs, ip->i_number);
% return (cgbase(fs, cg) + fs->fs_frag);
% }

I think "indx" here is the index into an array of block pointers in
the inode or an indirect block.  So for extending large files it is
always into an indirect block.  It gets reset to 0 for each new indirect
block.  This makes its use in (indx % fs->fs_maxbpg == 0) dubious.
The condition is satisfied whenever:
- indx == 0, i.e., always at the start of a new indirect block.  Not
   too bad, but not what we want if fs_maxbpg is much larger than the
   number of indexes in an indirect block.
- index == nonzero multiple of number of indexes in an indirect block.
   This condition is never be satisfied if fs_maxbpg is larger than
   the number of indexes in an indirect block.  This is the usual case
   for ffs2 (only 2K indexes in 16K-blocks, and fairly large cg's).  On
   an ffs1 fs that I have handy, maxbpg is 2K and the number of indexes
   is 4K, so this condition is satisfied once.

The (bap[indx - 1] == 0) condition causes a move to a new cg after
every hole.  This may help by leaving space to fill in the hole, but
it is wrong if the hole will never be filled in or is small.  This
seems to be just a vestige of code that implemented the old rotdelay
pessimization.  Comments saying that we use fs_maxcontig near here
are obviously vestiges of the pessimization.

% /*
% * Find a cylinder with greater than average number of
% * unused data blocks.
% */
% if (indx == 0 || bap[indx - 1] == 0)
% startcg =
%    ino_to_cg(fs, ip->i_number) + lbn / fs->fs_maxbpg;

At the start of an indirect block, and after a hole, we don't know where
the previous block was so we use the cg of the inode advanced by the
estimate (lbn / fs->fs_maxbpg) of how far we have advanced from the cg
of the inode.  I think this estimate is too primitive to work right even
a small fraction of the time.  Adjustment factors related to the number
of maxbpg's per block of indexes and the fullness of the disk seem to
be required.  Keeping track of the cg of the previous block would be better.

% else
% startcg = dtog(fs, bap[indx - 1]) + 1;

Now there is no problem finding the cg of the previous block.  Note that
we always add 1...

% startcg %= fs->fs_ncg;
% avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
% for (cg = startcg; cg < fs->fs_ncg; cg++)

... so the search gives maximal non-preference to the cg of the previous
block.  I think things would work much better if we considered the
current cg, if any, first (current cg = one containing previous block), and
we actually know that cg.  This would be easy to try -- just don't add 1.
Also try not adding the bad estimate (lbn / fs->fs_maxbpg), so that the
search starts at the inode's cg in some cases -- then previous cg's will
be reconsidered but hopefully the average limit will prevent them being
used.

Note that in the calculation of avgbfree, division by ncg gives a
granularity of ncg, so there is an inertia of ncg blocks against moving
to the next cg.  A too-large ncg is a feature here.

BTW, I recently found the bug that broke the allocation policy in FreeBSD's
implementation of ext2fs.  I thought that the bug was missing code/a too
simple implementation (one without a search like the above), but it turned
out to be just a bug.  The search wasn't set up right, so the current cg
was always preferred.  Always preferring the current cg tends to give
contiguous allocation of data blocks, and this works very well for small
file systems, but for large file systems the data blocks end up too far
away from inodes (since there is a limited number of inodes per cg and
the per-cg inode and data block allocations fill up at different rates.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Bruce Evans-4
In reply to this post by Eric Anderson-13
On Fri, 21 Sep 2007, Eric Anderson wrote:

> I recommend trying msdos fs.  On recent -CURRENT, it should perform fairly
> well (akin to UFS2 I think), and if I recall correctly, has a more contiguous
> block layout.

It can give perfect contiguity for data blocks, but has serious slowness for
non-sequential access to large files, and anyway "large" for msdosfs is
only 4GB.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Eric Anderson-13
Bruce Evans wrote:

> On Fri, 21 Sep 2007, Eric Anderson wrote:
>
>> I recommend trying msdos fs.  On recent -CURRENT, it should perform
>> fairly well (akin to UFS2 I think), and if I recall correctly, has a
>> more contiguous block layout.
>
> It can give perfect contiguity for data blocks, but has serious slowness
> for
> non-sequential access to large files, and anyway "large" for msdosfs is
> only 4GB.


Oops - forgot about the 4GB limit.  I was also assuming that the random
read in a big file problem wasn't an issue due to the configuration
noted by the original poster.. but maybe that's a bad assumption.

Eric
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Oliver Fromme
In reply to this post by Fluffles
Fluffles <[hidden email]> wrote:
 > I've setup a concat of 8 disks for my new NAS, using ataidle to spindown
 > the disks not needed. This allows me to save power and noise/heat by
 > running only the drives that are actually in use.
 >
 > My problem is UFS. UFS2 seems to write to 4 disks, even though all the
 > data written so far can easily fit on just one disk. What's going on
 > here? I looked at newfs parameters, but in the past was unable to make
 > newfs write contigiously. It seems UFS2 always writes to a new cylinder.
 > Is there any way to force UFS to write contigiously? Or at least limit
 > the problem?
 >
 > If i write 400GB to a 4TB volume consisting of 8x 500GB disks, i want
 > all data to be on the first disk.

You should be able to achieve that by putting a gvirstor
onto your drives, having the physical size of those eight
drives.  Then newfs that gvirstor device.

I haven't used gvirstor myself, but if I understand it
correctly, it should start filling its providers from the
start, and only begin using the next one when the previous
ones are all completely used.  So it should do exactly
what you want.

http://wiki.freebsd.org/gvirstor

Best regards
   Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"One of the main causes of the fall of the Roman Empire was that,
lacking zero, they had no way to indicate successful termination
of their C programs."
        -- Robert Firth
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Bruce Evans-4
In reply to this post by Gary Palmer-2
On Fri, 21 Sep 2007, Gary Palmer wrote:

> On Fri, Sep 21, 2007 at 03:23:20PM +0200, Ivan Voras wrote:
>> Gary Palmer wrote:
>>
>>> Presumably by using the -c parameter to newfs.
>>
>> Hm, I'll try it again later but I think I concluded that -c can be used
>> to lower the size of cgs, not to increase it.

Yes, it used to default to a small value, but that became very pessimal
when disks became larger than a whole 1GB or so, so obrien changed it
to default to the maximum possible value.  I think it hasn't been
changed back down.

> A CG is basically an inode table with a block allocation bitmap to keep
> track of what disk blocks are in use.  You might have to use the -i
> parameter to increase the expected average file size.  That should
> allow you to increase the CG size.  Its been a LONG time since I looked
> at the UFS code, but I suspect the # of inodes per CG is probably capped.

The limit seems to be only that struct cg (mainly the struct hack stuff
at the end) fits in a single block.  The non-struct parts of this
struct consist mainly of the inode, block and cluster bitmaps.  The
block bitmap is normally the largest by far, since it actually maps
fragments.  With 16K-blocks and 2K-frags, at most 128K frags = 256MB
of disk can be mapped.  I get 180MB in practice, with an inode bitmap
size of only 3K, so there is not much to be gained by tuning -i but
more to be gained by tuning -b and -f (several doublings are reasonable).
However, I think small cg's are not a problem for huge files, except
for bugs.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Ivan Voras
In reply to this post by Oliver Fromme
Oliver Fromme wrote:

> I haven't used gvirstor myself, but if I understand it
> correctly, it should start filling its providers from the
> start, and only begin using the next one when the previous
> ones are all completely used.  So it should do exactly
> what you want.

Yes, with the side-effect of putting all cgs and their metadata on the
beginning of the first drive. An obvious side-effect is that, writing to
any drive other than the first will also touch the first drive.


signature.asc (258 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Bruce Evans-4
In reply to this post by Bruce Evans-4
On Sat, 22 Sep 2007, Bruce Evans wrote:

> The current (and very old?) allocation policy for extending files is
> to consider allocating the block in a new cg when the current cg has
> more than fs_maxbpg blocks allocated in it (newfs and tunefs parameter
> -e maxbpg: default 1/4 of the number of blocks in a cg = bpg).  Then
> preference is given to the next cg with more than the average number
> of free blocks.  This seems to be buggy.  From ffs_blkpref_ufs1():

Actually, it is almost as good as possible.

Note that ffs_blkpref_*() only gives a preference, so it shouldn't try
too hard.  Also, it is only used if block reallocation is enabled (the
default: sysctl ffs.vfs.doreallocblks=1).  Then it gives delayed
allocation.  Block reallocation generally does a good job of realocating
blocks contiguously.  I don't know exactly where/how the allocation
is done if block reallocation is not enabled, but certainly, maxbpg
is not used then.

> % if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
> % if (lbn < NDADDR + NINDIR(fs)) {
> % cg = ino_to_cg(fs, ip->i_number);
> % return (cgbase(fs, cg) + fs->fs_frag);
> % }
>
> I think "indx" here is the index into an array of block pointers in
> the inode or an indirect block.

This is correct.

> So for extending large files it is
> always into an indirect block.  It gets reset to 0 for each new indirect
> block.  This makes its use in (indx % fs->fs_maxbpg == 0) dubious.
> The condition is satisfied whenever:
> - indx == 0, i.e., always at the start of a new indirect block.  Not
>  too bad, but not what we want if fs_maxbpg is much larger than the
>  number of indexes in an indirect block.

Actually, this case is handled quite well later.

> - index == nonzero multiple of number of indexes in an indirect block.
>  This condition is never be satisfied if fs_maxbpg is larger than
>  the number of indexes in an indirect block.  This is the usual case
>  for ffs2 (only 2K indexes in 16K-blocks, and fairly large cg's).  On
>  an ffs1 fs that I have handy, maxbpg is 2K and the number of indexes
>  is 4K, so this condition is satisfied once.

This case is not handled well.  The bug for it is mainly in newfs.
>From newfs.c:

% /*
%  * MAXBLKPG determines the maximum number of data blocks which are
%  * placed in a single cylinder group. The default is one indirect
%  * block worth of data blocks.
%  */
% #define MAXBLKPG(bsize) ((bsize) / sizeof(ufs2_daddr_t))

The comment is correct, but the code is wrong for ffs2.  Then
MAXBLKPG defaults to half an indirect block worth of data blocks.
I just use the default, so my ffs1 fs has maxbpg instead of 4K.

> The (bap[indx - 1] == 0) condition causes a move to a new cg after
> every hole.

Actually, this case is handled well later.

> This may help by leaving space to fill in the hole, but
> it is wrong if the hole will never be filled in or is small.  This
> seems to be just a vestige of code that implemented the old rotdelay
> pessimization.

Actually, it is still needed for using bap[indx - 1] at the end of function.

> Comments saying that we use fs_maxcontig near here
> are obviously vestiges of the pessimization.
>
> % /*
> % * Find a cylinder with greater than average number of
> % * unused data blocks.
> % */
> % if (indx == 0 || bap[indx - 1] == 0)
> % startcg =
> %    ino_to_cg(fs, ip->i_number) + lbn /
> fs->fs_maxbpg;
>
> At the start of an indirect block, and after a hole, we don't know where
> the previous block was so we use the cg of the inode advanced by the
> estimate (lbn / fs->fs_maxbpg) of how far we have advanced from the cg
> of the inode.  I think this estimate is too primitive to work right even
> a small fraction of the time.  Adjustment factors related to the number
> of maxbpg's per block of indexes and the fullness of the disk seem to
> be required.  Keeping track of the cg of the previous block would be better.

Actually, this estimate works very well.  We _want_ to change to a new
cg after every maxpg blocks.  The estimate gives the closest cg that
is possible if all the blocks are allocated as contiguously as we want.
If the disk is nearly full we will probably have to go further.  Starting
the search at the closest cg that we want gives a bias towards close
cg's that are not too close.

> % else
> % startcg = dtog(fs, bap[indx - 1]) + 1;
>
> Now there is no problem finding the cg of the previous block.  Note that
> we always add 1...
>
> % startcg %= fs->fs_ncg;
> % avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
> % for (cg = startcg; cg < fs->fs_ncg; cg++)
>
> ... so the search gives maximal non-preference to the cg of the previous
> block.  I think things would work much better if we considered the
> current cg, if any, first (current cg = one containing previous block), and
> we actually know that cg.  This would be easy to try -- just don't add 1.
> Also try not adding the bad estimate (lbn / fs->fs_maxbpg), so that the
> search starts at the inode's cg in some cases -- then previous cg's will
> be reconsidered but hopefully the average limit will prevent them being
> used.

Actually, adding 1 is correct in most cases.  Here we think we have just
allocated maxcontig blocks in the current cg, so we _want_ to advance to
the next cg.  The problem is that we don't really know that we have
allocated that many blocks.  We have lots of previous block numbers in
bap[] and could inspect many of them, but we only inspect the previous
one.  The corresponding code in 4.4BSD is better -- it inspects the
one some distance before the previous one.  The corresponding diistance
here is maxbpg.  We could inspect the blocks at 1 previous and maxbpg
previous to quickly estimate if we have allocated all of the previous
maxbpg blocks in the same cylinder group.

> Note that in the calculation of avgbfree, division by ncg gives a
> granularity of ncg, so there is an inertia of ncg blocks against moving
> to the next cg.  A too-large ncg is a feature here.

This feature shouldn't make much difference, but we don't want it if we
are certain that we have just allocated maxbpg blocks in a cg.

Analysis of block layouts for a 200MB file shows no large problems in
this area, but some small ones.  This is with some problems already
fixed.  200MB is a bit small but gives data small enough to understand
easily.  The analysis is limited to ffs1 since I only have a layout-
printing program for that.  I don't use ffs2 and haven't fixed the
"some" problems for it.  Perhaps they are the ones that matter here.
(For what they are, see below.)

ffs1, no soft updates (all tests on an almost-new fs):

% fs_bsize = 16384
% fs_fsize = 2048
% 4: lbn 0-11 blkno 1520-1615
% lbn [<1>indir]12-4107 blkno 1616-1623
% lbn 12-4107 blkno 1624-34391

Everything is perfectly configuous until here.  Without my fixes, the first
indirect block in the middle tends to be allocated discontiguously.  Here
lbn's have size fs_bsize = 16K, and blkno's have size fs_fsize = 2K; "4:"
is just the inode number; "[<n>indir>]" is an nth indirect block.

% lbn [<2>indir]4108-16781323 blkno 189592-189599

Bug.  cg's have size about 94000 in blkno units.  We have skipped the
entire second cg.

% lbn [<1>indir]4108-8203 blkno 189600-189607
% lbn 4108-6155 blkno 189608-205991

All contiguous.

% lbn 6156-8203 blkno 283640-300023

This is from the newfs bug (default maxbpg = half an indirect block's
worth of blkno's).  Here we advance to the next cg half way through
the indirect block.  The advance is only about 90000 blkno's so it
correctly doesn't skip a cg.

% lbn [<1>indir]8204-12299 blkno 377688-377695
% lbn 8204-10251 blkno 377696-394079
% lbn 10252-12299 blkno 471736-488119
% lbn [<1>indir]12300-16395 blkno 565784-565791
% lbn 12300-12799 blkno 565792-569791

The pattern continues with no problems except the default maxbpg being
to small.  This does almost what the OP wants -- with a huge disk,
even huge files fit in a few cg's (lots of cg's but few compared with
the total number).  With tunefs -e <maxbpg=bpg>, I think the layout
would be perfectly contiguous except for the skip after the first cg.

My fix is only for the first indirect block, so it doesn't make much
difference for large files.  With the default maxbpg, later indirect
blocks are always allocated in a new cg anyway.  Hopefully the
"primitive" estimate prevents this so that all indirect blocks have
a chance of being allocated contiguously, and other code cooperates
by not moving them.

ffs1, soft updates:

% fs_bsize = 16384
% fs_fsize = 2048
% 5: lbn 0-11 blkno 34392-34487

For some reason, the file is started later in the first cg.

% lbn [<1>indir]12-4107 blkno 34488-34495
% lbn 12-4107 blkno 34496-67263

Contiguous.  Without my fix, soft updates seems to move the first indirect
block further away, and thus is noticeably slower.

% lbn [<2>indir]4108-16781323 blkno 285592-285599

Soft updates has skipped not just the second cg but the third one too.

% lbn [<1>indir]4108-8203 blkno 285600-285607
% lbn 4108-6155 blkno 285608-301991
% lbn 6156-8203 blkno 377688-394071
% lbn [<1>indir]8204-12299 blkno 471736-471743
% lbn 8204-10251 blkno 471744-488127
% lbn 10252-12299 blkno 565784-582167
% lbn [<1>indir]12300-16395 blkno 659832-659839
% lbn 12300-12799 blkno 659840-663839

The pattern continues (no more skips).

ffs1, no soft updates, maxbpg = 655360:

% fs_bsize = 16384
% fs_fsize = 2048
% 4: lbn 0-11 blkno 1520-1615
% lbn [<1>indir]12-4107 blkno 1616-1623
% lbn 12-4107 blkno 1624-34391
% lbn [<2>indir]4108-16781323 blkno 95544-95551
% lbn [<1>indir]4108-8203 blkno 95552-95559
% lbn 4108-8203 blkno 95560-128327
% lbn [<1>indir]8204-12299 blkno 189592-189599
% lbn 8204-12299 blkno 189600-222367
% lbn [<1>indir]12300-16395 blkno 283640-283647
% lbn 12300-12799 blkno 283648-287647

The "primitive" estimate isn't helping -- a new cg is started for every
indirect block.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Rick C. Petty-3
In reply to this post by Ivan Voras
On Fri, Sep 21, 2007 at 02:45:35PM +0200, Ivan Voras wrote:
> Stefan Esser wrote:
>
> From experience (not from reading code or the docs) I conclude that
> cylinder groups cannot be larger than around 190 MB. I know this from
> numerous runnings of newfs and during development of gvirstor which
> interacts with cg in an "interesting" way.

Then you didn't run newfs enough:

# newfs -N -i 12884901888 /dev/gvinum/mm-flac
density reduced from 2147483647 to 3680255
/mm/flac: 196608.0MB (402653184 sectors) block size 16384, fragment size 2048
        using 876 cylinder groups of 224.50MB, 14368 blks, 64 inodes.

When specifying the -i option to newfs, it will minimize the number of
inodes created.  If the density option is high enough, it will use only one
block of inodes per CG (the minimum)..  from there, the density is reduced
(as per the message above) and the CG size is increased until the frag
bitmap can fit into a single block.  With UFS2 and the default options of
-b 16384 -f 2048, this gives you 224.50 MB per CG.

If you wish to play around with the block/frag sizes, you can greatly
increase the CG size:

# newfs -N -f 8192 -b 65536 -i 12884901888 /dev/gvinum/mm-flac
density reduced from 2147483647 to 14868479
/mm/flac: 196608.0MB (402653184 sectors) block size 65536, fragment size 8192
        using 55 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.

Doing this is quite appropriate for large disks.  This last command means:
blocks are allocated in 64k chunks and the minimum allocation size is 8k.
Some may say this is wasteful, but one could also argue that using less
than 10% of your inodes is also wasteful.

> I know the reasons why cgs
> exist (mainly to lower latencies from seeking) but with todays drives

I don't believe that is true.  CGs exist because to prevent complete data
loss if the front of the disk is trashed.  The blocks and inodes have close
proximity partly for lower latency but also to reduce corruption risk.
It is suggested that the CG offsets are staggered to make best use of
rotational delay but this is obviously irrelevent with modern drives.

> and memory configurations it would sometimes be nice to make them larger
> or in the extreme, make just one cg that covers the entire drive.

And put it in the middle of the drive, not at the front.  Gee, this is what
NTFS does..  Hmm...

There are significant advantages to staggering the CGs across the device
(or in the case of some GEOM: providers).

Here might be an interesting experiment to try.  Write a new version of
/usr/src/sbin/newfs/mkfs.c that doesn't have the restriction that the free
fragment bitmap resides in one block.  I'm not 100% sure if the FFS code
would handle it properly, but in theory it should work (the offsets are
stored in the superblocks).  This is the biggest restriction on the CG
size.  You should be able to create 2-4 CGs to span each of your 1TB
drives without increasing the block size and thus minimum allocation unit.

--


-- Rick C. Petty
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Writing contigiously to UFS2?

Rick C. Petty-3
In reply to this post by Bruce Evans-4
On Sat, Sep 22, 2007 at 04:10:19AM +1000, Bruce Evans wrote:
>
> of disk can be mapped.  I get 180MB in practice, with an inode bitmap
> size of only 3K, so there is not much to be gained by tuning -i but

I disagree.  There is much to be gained by tuning -i: 224.50 MB per CG vs.
183.77 MB..  that's a 22% difference.

However, the biggest gain by tuning -i is the loss of extra (unused)
inodes.  Care should be used with the -i option-- running out of inodes
when you have gigs of free space could be very frustrating.  But I newfs
all my volumes knowing an approximate inode density based on
already-existing files and a minor fudge factor.  The only time I ran out
of inodes with this method was due to a calculation error on my part.

> more to be gained by tuning -b and -f (several doublings are reasonable).

I completely agree with this.  It's unfortunate that newfs doesn't scale
the defaults here based on the device size.  Before someone dives in and
commits any adjustments, I hope they do sufficient testing and post their
results on this mailing list.

-- Rick C. Petty
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
12