RAID-Z wasted space - asize roundups to nparity +1

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RAID-Z wasted space - asize roundups to nparity +1

Adam Nowacki-3
I've just found something very weird in the ZFS code.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:504 in HEAD

Can someone explain the reason behind this line of code? What it does is
align on-disk record size to a multiple of number of parity disks + 1
... this really doesn't make any sense. So far as I can tell those extra
sectors are just padding - completely unused.

For the array I'm using this results in 4.8% of wasted disk space -
1.7TB. It's a 12x 3TB disk RAID-Z2.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Matthew Ahrens
This is so that we won't end up with small, unallocatable segments.  E.g.
if you are using RAIDZ2, the smallest usable segment would be 3 sectors (1
sector data + 2 sectors parity).  If we left a 1 or 2 sector free segment,
it would be unusable and you'd be able to get into strange accounting
situations where you have free space but can't write because you're "out of
space".

The amount of waste due to this can be minimized by using larger blocksizes
(e.g. the default recordsize of 128k and files larger than 128k), and by
using smaller sector sizes (e.g. 512b sector disks rather than 4k sector
disks).  In your case these techniques would limit the waste to 0.6%.

--matt

On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki <[hidden email]>wrote:

> I've just found something very weird in the ZFS code.
>
> sys/cddl/contrib/opensolaris/**uts/common/fs/zfs/vdev_raidz.**c:504 in
> HEAD
>
> Can someone explain the reason behind this line of code? What it does is
> align on-disk record size to a multiple of number of parity disks + 1 ...
> this really doesn't make any sense. So far as I can tell those extra
> sectors are just padding - completely unused.
>
> For the array I'm using this results in 4.8% of wasted disk space - 1.7TB.
> It's a 12x 3TB disk RAID-Z2.
> ______________________________**_________________
> [hidden email] mailing list
> http://lists.freebsd.org/**mailman/listinfo/freebsd-fs<http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@**freebsd.org<[hidden email]>
> "
>
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Adam Nowacki-3
On 2013-01-28 22:55, Matthew Ahrens wrote:
> This is so that we won't end up with small, unallocatable segments.
>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
> free segment, it would be unusable and you'd be able to get into strange
> accounting situations where you have free space but can't write because
> you're "out of space".

Sounds reasonable.

> The amount of waste due to this can be minimized by using larger
> blocksizes (e.g. the default recordsize of 128k and files larger than
> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
> than 4k sector disks).  In your case these techniques would limit the
> waste to 0.6%.

This brings another issue - recordsize capped at 128KiB. We are using
the pool for off-line storage of large files (from 50MB to 20GB). Files
are stored and read sequentially as a whole. With 12 disks in RAID-Z2,
4KiB sectors, 128KiB record size and the padding above 9.4% of disk
space goes completely unused - one whole disk.

Increasing recordsize cap seems trivial enough. On-disk structures and
kernel code support it already - a single of code had to be changed
(#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes.
This of course breaks compatibility with any other system without this
modification. With Suns cooperation this could be handled in safe and
compatible manner via pool version upgrade. Recordsize of 128KiB would
remain the default but anyone could increase it with zfs set.

Pool appears to work just fine with 15TB copied so far from another
pool. Wasted disk space drops down to 0.7%. Sequential read speed
increased from ~400MB/s to ~600MB/s. Writes stay about the same at ~300MB/s.

So far however I was not able to boot from that pool. gptzfsboot
required a heap size increase and appears to work. zfsloader crashes and
I've become lost in the code.

I've also identified another problem with ZFS wasting disk space. When
compression is off allocations are always a multiple of record size.
With the default recordsize of 128KiB a 129KiB file would use 256KiB of
disk space (+ parity and other inefficiencies mentioned above). This may
be there to help with fragmentation but then it would be good to have a
setting to turn it off - even if by means of a no-op compression that
would count zeroes backwards and return short psize.

>
> --matt
>
> On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I've just found something very weird in the ZFS code.
>
>     sys/cddl/contrib/opensolaris/__uts/common/fs/zfs/vdev_raidz.__c:504
>     in HEAD
>
>     Can someone explain the reason behind this line of code? What it
>     does is align on-disk record size to a multiple of number of parity
>     disks + 1 ... this really doesn't make any sense. So far as I can
>     tell those extra sectors are just padding - completely unused.
>
>     For the array I'm using this results in 4.8% of wasted disk space -
>     1.7TB. It's a 12x 3TB disk RAID-Z2.
>     _________________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list
>     http://lists.freebsd.org/__mailman/listinfo/freebsd-fs
>     <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
>     To unsubscribe, send any mail to
>     "[hidden email]
>     <mailto:[hidden email]>"
>
>

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Steven Hartland

----- Original Message -----
From: "Adam Nowacki" <[hidden email]>


> On 2013-01-28 22:55, Matthew Ahrens wrote:
>> This is so that we won't end up with small, unallocatable segments.
>>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
>> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
>> free segment, it would be unusable and you'd be able to get into strange
>> accounting situations where you have free space but can't write because
>> you're "out of space".
>
> Sounds reasonable.
>
>> The amount of waste due to this can be minimized by using larger
>> blocksizes (e.g. the default recordsize of 128k and files larger than
>> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
>> than 4k sector disks).  In your case these techniques would limit the
>> waste to 0.6%.
>
> This brings another issue - recordsize capped at 128KiB. We are using
> the pool for off-line storage of large files (from 50MB to 20GB). Files
> are stored and read sequentially as a whole. With 12 disks in RAID-Z2,
> 4KiB sectors, 128KiB record size and the padding above 9.4% of disk
> space goes completely unused - one whole disk.

This is something thats being worked on upstream, its not as trivial
as it first looks unfortuantely.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to [hidden email].

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Olivier Smedts
In reply to this post by Adam Nowacki-3
2013/1/29 Adam Nowacki <[hidden email]>:

> This brings another issue - recordsize capped at 128KiB. We are using the
> pool for off-line storage of large files (from 50MB to 20GB). Files are
> stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
> sectors, 128KiB record size and the padding above 9.4% of disk space goes
> completely unused - one whole disk.
>
> Increasing recordsize cap seems trivial enough. On-disk structures and
> kernel code support it already - a single of code had to be changed (#define
> SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of
> course breaks compatibility with any other system without this modification.
> With Suns cooperation this could be handled in safe and compatible manner
> via pool version upgrade. Recordsize of 128KiB would remain the default but
> anyone could increase it with zfs set.

One MB blocksize is already implemented by Oracle with zpool version 32.

--
Olivier Smedts                                                 _
                                        ASCII ribbon campaign ( )
e-mail: [hidden email]        - against HTML email & vCards  X
www: http://www.gid0.org    - against proprietary attachments / \

  "Il y a seulement 10 sortes de gens dans le monde :
  ceux qui comprennent le binaire,
  et ceux qui ne le comprennent pas."
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Steven Hartland
----- Original Message -----
From: "Olivier Smedts" <[hidden email]>


> 2013/1/29 Adam Nowacki <[hidden email]>:
>> This brings another issue - recordsize capped at 128KiB. We are using the
>> pool for off-line storage of large files (from 50MB to 20GB). Files are
>> stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
>> sectors, 128KiB record size and the padding above 9.4% of disk space goes
>> completely unused - one whole disk.
>>
>> Increasing recordsize cap seems trivial enough. On-disk structures and
>> kernel code support it already - a single of code had to be changed (#define
>> SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of
>> course breaks compatibility with any other system without this modification.
>> With Suns cooperation this could be handled in safe and compatible manner
>> via pool version upgrade. Recordsize of 128KiB would remain the default but
>> anyone could increase it with zfs set.
>
> One MB blocksize is already implemented by Oracle with zpool version 32.

Oracle is not the upstream, since they went closed source, illumos is our new
upstream.

It you want to follow the discussion see the thread titled "128K max blocksize in
zfs" on [hidden email].

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to [hidden email].
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Freddie Cash-8
In reply to this post by Adam Nowacki-3
On Jan 29, 2013 2:52 AM, "Adam Nowacki" <[hidden email]> wrote:

>
> On 2013-01-28 22:55, Matthew Ahrens wrote:
>>
>> This is so that we won't end up with small, unallocatable segments.
>>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
>> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
>> free segment, it would be unusable and you'd be able to get into strange
>> accounting situations where you have free space but can't write because
>> you're "out of space".
>
>
> Sounds reasonable.
>
>
>> The amount of waste due to this can be minimized by using larger
>> blocksizes (e.g. the default recordsize of 128k and files larger than
>> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
>> than 4k sector disks).  In your case these techniques would limit the
>> waste to 0.6%.
>
>
> This brings another issue - recordsize capped at 128KiB. We are using the
pool for off-line storage of large files (from 50MB to 20GB). Files are
stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
sectors, 128KiB record size and the padding above 9.4% of disk space goes
completely unused - one whole disk.
>
> Increasing recordsize cap seems trivial enough. On-disk structures and
kernel code support it already - a single of code had to be changed
(#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes.
This of course breaks compatibility with any other system without this
modification. With Suns cooperation this could be handled in safe and
compatible manner via pool version upgrade. Recordsize of 128KiB would
remain the default but anyone could increase it with zfs set.

There's work upstream (Illumos, I believe, maybe Delphix?) to add support
for recordings above 128 KB. It'll be added ad a feature flag, so only
compatible with open-source ZFS.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: RAID-Z wasted space - asize roundups to nparity +1

Matthew Ahrens
In reply to this post by Adam Nowacki-3
On Tue, Jan 29, 2013 at 2:51 AM, Adam Nowacki <[hidden email]>wrote:

> I've also identified another problem with ZFS wasting disk space. When
> compression is off allocations are always a multiple of record size. With
> the default recordsize of 128KiB a 129KiB file would use 256KiB of disk
> space (+ parity and other inefficiencies mentioned above). This may be
> there to help with fragmentation but then it would be good to have a
> setting to turn it off - even if by means of a no-op compression that would
> count zeroes backwards and return short psize.
>

The most straightforward way to do this would be, as you alluded, to always
compress the last block of the file, even if no compression has been
selected.  For maximum speed, we could use the already-implemented zle
(zero-length encoding) algorithm.

--matt
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"