test hash functions for fsid

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

test hash functions for fsid

Rick Macklem
Hi,

If you have a FreeBSD system with lots of local file systems (1000s), maybe you could
run the attached program on it and email me the stats.
I'm just trying to see how well these hash functions work on fsids.

Thanks in advance, rick

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"

testhash.c (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Conrad Meyer-2
Hi Rick,

On Wed, May 8, 2019 at 7:39 AM Rick Macklem <[hidden email]> wrote:
> If you have a FreeBSD system with lots of local file systems (1000s), maybe you could
> run the attached program on it and email me the stats.
> I'm just trying to see how well these hash functions work on fsids.

Below I'll get to some comments on hash functions and hash tables, but
first: have you considered a non-hash data structure, such as PCTRIE?
PCTRIE_DEFINE() creates a datastructure with similar properties to a
hash table for this use, but does not require a hash function step at
all because all fsids are already unique, fixed-size, 64-bit values.
It may be especially space-efficient for non-ZFS FSIDs, when the
64-bit key is composed with '(fsid.val[1] << 32 | fsid.val[0])'.
(Elaborated on more below.(A))

Usually it makes sense to size hash tables for the size of the data;
for direct insert tables you want to keep the load factor below 0.8 or
something like that.  So the hardcoded 256 doesn't make a ton of
sense, IMO.

As far as hash functions, both (hash32_buf == djb2, FNV32) are really
weak, but importantly for hash tables, fast.
https://github.com/Cyan4973/smhasher#summary provides some broader
context, but keep in mind most of those algorithms are targeting much
longer keys than 8 bytes.  I would guess that h1/h3 will be noticably
worse than h2/h4 without performing any better, but I don't have data
to back that up.

(If you were to test calculate_crc32c() or XXH32/XXH64, included in
sys/contrib/zstd, you might observe fewer collisions and better
distribution.  But that isn't necessarily the most important factor
for hash table performance in this use.)

(A) FSIDs themselves have poor initial distribution and *very* few
unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32
fsid val[1] will be identical across all filesystems of the same type,
such as NFS mounts).  The remaining val[0] is unique due to an
incrementing global counter.  You could pretty easily extract the
vfs_getnewfsid() code out to userspace to simulate creating a bunch of
fsid_t's and run some tests on that list.  It isn't a real world
distribution but it would probably be pretty close, outside of ZFS.
ZFS FSIDs have 8 bits of shared vfs type, but the remaining 56 bits
appears to be generated by arc4rand(9), so ZFS FSIDs will have good
distribution even without hashing.

Anyway, maybe this is helpful, maybe not.  Thanks for working on
scaling exports, and NFS in general!  It is appreciated.

Take care,
Conrad
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Bruce Evans-4
On Wed, 8 May 2019, Conrad Meyer wrote:

> On Wed, May 8, 2019 at 7:39 AM Rick Macklem <[hidden email]> wrote:
>> If you have a FreeBSD system with lots of local file systems (1000s), maybe you could
>> run the attached program on it and email me the stats.
>> I'm just trying to see how well these hash functions work on fsids.
>
> Below I'll get to some comments on hash functions and hash tables, but
> first: have you considered a non-hash data structure, such as PCTRIE?
> PCTRIE_DEFINE() creates a datastructure with similar properties to a
> hash table for this use, but does not require a hash function step at
> all because all fsids are already unique, fixed-size, 64-bit values.
> It may be especially space-efficient for non-ZFS FSIDs, when the
> 64-bit key is composed with '(fsid.val[1] << 32 | fsid.val[0])'.
> (Elaborated on more below.(A))

fsids must be unique in 32 bits, especially for nfs uses, since they
are usually blindly truncated to 32 bits in nfs clients that have
32-bit dev_t to form st_dev, but st_dev must be unique.

The FreeBSD-5 client blindly truncates:

  vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0];

The FreeBSD-11 client does the same for nfsv3, but for nfsv4 it does
hashing stuff to convert from a 128-bit (?) na_filesid to 32 bits.

va_fsid isn't really an fsid.  It is a dev_t in all versions of FreeBSD,
so it is too small to hold the 64-bit fsids created by vfs_getnewfsid()
in all versions of FreeBSD except recent versions where dev_t is also
64 bits -- it works accidentally for them.

> (A) FSIDs themselves have poor initial distribution and *very* few
> unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32
> fsid val[1] will be identical across all filesystems of the same type,
> such as NFS mounts).  The remaining val[0] is unique due to an
> incrementing global counter.

val[1] is useless unless there are more than 256 fs types, since it is
already in the top 8 bits of the 24-bit minor number in fsid.val[0],
except on systems with 8-bit minor numbers and in buggy versions of
nfs that truncate the minor number to 8 bits (very old versions of oldnfs
and not so old versions of newnfs).

val[1] should never be used since it gets truncated away in many cases.

val[0] has 16 bits of uniqueness from the counter, 8 mostly wasted for
distinguishing the fs type, and another 8 bits even more mostly wasted
for distinguishing the device type (255 for all synthetic devices was
to stay away from major numbers 0-254 for physical devices, but devfs
made all major numbers for physical devices 0).

In some versions of FreeBSD, changing makedev() broke the uniqueness of
val[0] by shifting the major bits into obvlivion (so synthetic devices
were not distinguishable from physical devices, and since physical devices
uses a dense set of minor numbers, collisions were likely).

The difference between servers and clients' use of fsid is confusing, and
I probably got some of the above wrong by conflating them.  Clients should
generate fsids unique in 32 bits.  Using vfs_getnewfsid() and ignoring
val[1] almost does this.  This is needed to support applications using
compat syscalls with 32-bit dev_t in st_dev and other places.  Other file
systems should also avoid actually using 64-bit dev_t for the same reason.
Using devfs almost does this.  It might be possible to generated more than
2**24 ptys so that the minor numbers start using the top 32 bits, but this
is foot-shooting and hopefully disk devices are already allocated with
low minor numbers.  Devices which can back file systems should be in a
different namespace (e.g., major == 1) to avoid this problem.

I think device numbers from servers are only visible to clients for
special files.  Uniqueness then is not so important, but I found that
changing makedev() broke it by noticing that server device numbers
were truncated on clients.  If the server fsid_t or dev_t is larger
than the client fsid_t or dev_t, then there is no way that the client
can represent all server numbers.  Hashing can work for sparse numbers,
but is complicated.  Servers should also not actually use 64-bit dev_t.

Bruce
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Bruce Evans-4
In reply to this post by Conrad Meyer-2
On Wed, 8 May 2019, Conrad Meyer wrote:

Another reply.

> (A) FSIDs themselves have poor initial distribution and *very* few
> unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32
> fsid val[1] will be identical across all filesystems of the same type,
> such as NFS mounts).  The remaining val[0] is unique due to an
> incrementing global counter.  You could pretty easily extract the
> vfs_getnewfsid() code out to userspace to simulate creating a bunch of
> fsid_t's and run some tests on that list.

It is a bug to actually use val[1] or 64-bit dev_t.  Using it is either
unnecessary because val[0] is unique, or breaks compat syscalls.  See
my previous reply.  (I have a patch to make the compat syscalls fail
if they would truncate the dev_t, but it is too strict to be the default
and I forget if I committed it).

> It isn't a real world
> distribution but it would probably be pretty close, outside of ZFS.
> ZFS FSIDs have 8 bits of shared vfs type, but the remaining 56 bits
> appears to be generated by arc4rand(9), so ZFS FSIDs will have good
> distribution even without hashing.

Bug in zfs.

Even stat(1) doesn't understand these 64-bit numbers.  It misformats
them in decimal:

$ stat ~
16921688315829575803 2179998 drwxr-xr-x 62 bde bde 18446744073709551615 85 "Nov 25 21:21:07 2016" "May  8 16:33:55 2019" "May  8 16:33:55 2019" "Sep 28 00:06:41 2015" 5632 24 0x800 /home/bde

The first large number is st_dev and the second large number is st_rdev.
The decimal formatting of these is hard enough to read when they are
32 bits.  The 1844 number looks a like it is near UINT64_MAX, and is
in fact exactly that, so it is far from unique.  It is apparently just
NODEV = (dev_t)(-1).  This is correct except for the formatting, while
in ffs st_rdev is garbage except for actual devices, since ufs_getattr()
sets va_rdev to di_rdev even for non-devices, but for non-devices
di_rdev is an alias for the first block number.

I did commit my checks for truncation of dev_t's (r335035 and r335053).
Fixing makedev() put the synthetic major number 255 in vfs_getnewfsid()
back in the low 32 bits, so the checks usually pass for nfs.  However,
for zfs, they would often fail for the large number in st_dev, and always
fail for regular files for the large number NODEV = UINT64_MAX in st_rdev.
They should fail for NODEV for regular files on all file systems.

Bruce
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Rick Macklem
In reply to this post by Rick Macklem
Peter wrote:

>Here’s the output from a couple of our servers (filur00 with around 24289 zfs >filesystems, balur01 with about 72350 zfs filesystems, filur04 about 7500 zfs >filesystems):
>
>
>Lpeter86@filur00:~ % ./testhash
>H1: max=24290 min=0
>H2: max=24290 min=0
>H3: max=24290 min=0
>H4: max=24290 min=0
>
>Lpeter86@balur01:~ % /tmp/testhash
>H1: max=73343 min=0
>H2: max=73343 min=0
>H3: max=73343 min=0
>H4: max=73343 min=0
>
>Lpeter86@filur04:~ % /tmp/testhash
>H1: max=7560 min=0
>H2: max=7560 min=0
>H3: max=7560 min=0
>H4: max=7560 min=0
>
I'll admit when I first looked at this, I was baffled;-)
Turns out the f_fsid field is just zeros when getmntinfo() is run by non-root.
Please do this again as "root". I have attached a slightly modified version
with an additional trivial hash and it calculates variance.

Thanks for doing this, rick


_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"

testhash.c (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Rick Macklem
In reply to this post by Conrad Meyer-2
Conrad Meyer wrote:

>On Wed, May 8, 2019 at 7:39 AM Rick Macklem <[hidden email]> wrote:
>> If you have a FreeBSD system with lots of local file systems (1000s), maybe you could
>> run the attached program on it and email me the stats.
>> I'm just trying to see how well these hash functions work on fsids.
>
>Below I'll get to some comments on hash functions and hash tables, but
>first: have you considered a non-hash data structure, such as PCTRIE?
>PCTRIE_DEFINE() creates a datastructure with similar properties to a
>hash table for this use, but does not require a hash function step at
>all because all fsids are already unique, fixed-size, 64-bit values.
>It may be especially space-efficient for non-ZFS FSIDs, when the
>64-bit key is composed with '(fsid.val[1] << 32 | fsid.val[0])'.
>(Elaborated on more below.(A))
I'll admit I had never heard of PCTRIE, but <sys/pctrie.h> seems to be
#ifdef _KERNEL, so it can't be used in userland?

>Usually it makes sense to size hash tables for the size of the data;
>for direct insert tables you want to keep the load factor below 0.8 or
>something like that.  So the hardcoded 256 doesn't make a ton of
>sense, IMO.
Yes. I just chose 256 for this test program. Unfortunately, the way mountd.c is
currently coded, the hash table must be sized before the getmntinfo() call,
so it must be sized before it knows how many file systems there are.
Since this is userland and table size isn't a memory issue, I'm just tempted to
make it large enough for a large server with something like 25,000 file systems.
(I'd guess somewhere in the 256->1024 range would work. I'm not sure what
 you mean by load factor below 0.8?)

>As far as hash functions, both (hash32_buf == djb2, FNV32) are really
>weak, but importantly for hash tables, fast.
>https://github.com/Cyan4973/smhasher#summary provides some broader
>context, but keep in mind most of those algorithms are targeting much
>longer keys than 8 bytes.  I would guess that h1/h3 will be noticably
>worse than h2/h4 without performing any better, but I don't have data
>to back that up.
>
>(If you were to test calculate_crc32c() or XXH32/XXH64, included in
>sys/contrib/zstd, you might observe fewer collisions and better
>distribution.  But that isn't necessarily the most important factor
>for hash table performance in this use.)
>
>(A) FSIDs themselves have poor initial distribution and *very* few
>unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32
>fsid val[1] will be identical across all filesystems of the same type,
>such as NFS mounts).  The remaining val[0] is unique due to an
>incrementing global counter.  You could pretty easily extract the
>vfs_getnewfsid() code out to userspace to simulate creating a bunch of
>fsid_t's and run some tests on that list.
Fortunately, neither ZFS nor UFS uses vfs_getnewfsid() unless there is a collision
and I don't really care about other file system types.
(To clarify for bde@ and others, this is for file systems exported by mountd,
 which uses a single linked list for all the exported file systems, keyed on the
 f_fsid returned by statfs/getmntinfo. Since almost all exported file systems are
 UFS of ZFS and I suspect only ZFS will have 1000s of file systems in an NFS server
 box anytime soon, I don't really care how well others work. After all, it just does
 a linear search down the list of N file systems right and just about anything
 should be an improvement.)

>It isn't a real world
>distribution but it would probably be pretty close, outside of ZFS.
>ZFS FSIDs have 8 bits of shared vfs type, but the remaining 56 bits
>appears to be generated by arc4rand(9), so ZFS FSIDs will have good
>distribution even without hashing.
I added a simple (take the low order bits of val[0]) case to the test. I actually
suspect any of the hash functions will work well enough, since, as you note, most of the values (val[0] and 24bits of val[1]) are from a random # generator which should
be pretty uniform in distribution for ZFS.
UFS now uses a value from the superblock. It appears that newfs sets val[0] to the
creation time of the file system and val[1] to a random value. If it had been the
reverse, I would be tempted to only use val[0], but I'll see how well the test goes
for Peter. (I didn't know, but it has to run as "root". When run as non-root, the
f_fsid field is returned as 0 for all file systems by getmntinfo().)

Thanks, rick

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Rick Macklem
In reply to this post by Conrad Meyer-2
Conrad Meyer wrote:
[good stuff snipped]
>Usually it makes sense to size hash tables for the size of the data;
>for direct insert tables you want to keep the load factor below 0.8 or
>something like that.  So the hardcoded 256 doesn't make a ton of
>sense, IMO.
I took a closer look at the code and I think I can delay use of the hash tables until
after getmntinfo() returns the number of file systems, with a little tweaking.
Since Peter already has a server with over 72000 file systems, setting it dynamically
seems worth doing.

I didn't understand what you meant by load factor below 0.8, but how does

   num / 20  (where "num" is the number of file systems->entries to be hashed)

sound for the size of the tables. (There would be one table now and a second one
allocated for the patch that does updated changes to the kernel, for a total of two
of them malloc()d when the daemon starts up. The tables are just SLIST_HEAD()s
or one pointer per table entry.

Btw, I will post Peter's test results if he says that's ok, but the first three hashes
work about equally well. For his big 72532 file system server and H1 (which
seems to be the winner by a small fraction):
Mean: 283
Ave variation about the mean: 13 (or less than 5%)
[more good stuff snipped]

Thanks, rick


_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Rick Macklem
In reply to this post by Rick Macklem
Peter Eriksson wrote:
>Ah, of course. Sorry about that :-)
Nothing to be sorry about. I didn't realize that f_fsid would be 0 for non-root
in the getmntinfo() reply.
The test uses a hash table size of 256 and doesn't print Ave or Ave variation,
so I added those manually.

>Here’s some output (as root):
>
># Balur01 - 72352 zfs filesystems
>root@balur01:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=340 min=237 totvari=3384.937500
>H2: max=338 min=238 totvari=3760.296875
>H3: max=333 min=241 totvari=3373.703125
>H4: max=487 min=277 totvari=36364.765625
>H5: max=324 min=107 totvari=36425.234375
Mean: 283
Ave variation about mean for best one: 13.2 (about 4.6%)

># Balur03 - 42023 zfs filesystems
>root@balur03:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=202 min=135 totvari=2530.750000
>H2: max=203 min=128 totvari=2792.500000
>H3: max=196 min=127 totvari=2597.500000
>H4: max=288 min=166 totvari=20886.000000
>H5: max=165 min=55 totvari=20887.250000
Mean: 164
Ave variation about mean for best one: 9.8 (about 6%)

># Filur01 - 21182 zfs filesystems
>root@filur01:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=111 min=60 totvari=1902.765625
>H2: max=113 min=56 totvari=1962.921875
>H3: max=116 min=58 totvari=1870.000000
>H4: max=158 min=83 totvari=10640.921875
>H5: max=79 min=23 totvari=10640.000000
Mean: 83
Ave variation about mean for best one: 7.3 (about 9%)

># Filur02 - 20981 zfs filesystems
>root@filur02:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=115 min=60 totvari=1905.117188
>H2: max=106 min=62 totvari=1737.164062
>H3: max=105 min=57 totvari=1961.210938
>H4: max=157 min=77 totvari=10462.023438
>H5: max=84 min=26 totvari=10451.976562
Mean: 82
Ave variation about mean for best one: 6.8 (about 8%)

># Filur04 - 7500 zfs filesystems
>root@filur04:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=43 min=14 totvari=1030.890625
>H2: max=48 min=16 totvari=1145.218750
>H3: max=46 min=17 totvari=1049.328125
>H4: max=63 min=25 totvari=3804.109375
Mean: 29
Ave variation about mean for best one: 4.0 (about 13%)

># Filur05 - 1478 zfs filesystems
>root@filur05:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=14 min=1 totvari=505.429688
>H2: max=17 min=0 totvari=569.273438
>H3: max=14 min=1 totvari=509.234375
>H4: max=19 min=2 totvari=890.671875
>H5: max=10 min=0 totvari=806.742188
Mean: 6
Ave variation about mean for best one: 2.0 (about 33%)

># testy - 53 zfs filesystems
>root@testy:/usr/home/Lpeter86 # /tmp/testhash
>H1: max=2 min=0 totvari=91.531250
>H2: max=2 min=0 totvari=91.078125
>H3: max=3 min=0 totvari=92.437500
>H4: max=3 min=0 totvari=107.187500
>H5: max=3 min=0 totvari=79.046875
I won't bother with Mean, etc. since most lists would have been 0-2 entries.

The first 3 hashes appear to work about equally well, with H1 being the winner
by a small margin. (H1 is fnv_32_buf() with an initial value of 0.)

>Just for fun I’ve also attached a screenshot from our “management console” with a >current status of our FreeBSD fileserver (11:30, so is just before lunch here). >Currently the number of NFS users are a bit on the low side (more NFS clients are >being integrated all the time, but a lot is student-driven (depends on what >classes/labs they are taken if the use the Linux labs or not).

Thanks for running the test Peter, rick
ps: If anyone else has a server with a lot of file systems (particularily a lot of UFS
       file systems), feel free to post test results.
       However, I think Peter's results have given me what I need, at least for ZFS.
pss: As a guy who code NFS when disk drives stored Mbytes and were the size of
       washing machines, I still can't quite wrap my head around a server with
       over 72,000 file systems on it;-)

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Conrad Meyer-2
In reply to this post by Rick Macklem
Hi Rick,

On Wed, May 8, 2019 at 5:41 PM Rick Macklem <[hidden email]> wrote:
[liberal snipping throughout :-)]
>
> I'll admit I had never heard of PCTRIE, but <sys/pctrie.h> seems to be
> #ifdef _KERNEL, so it can't be used in userland?

Ah, you're completely correct.  I had considered using PCTRIE for a
userspace application earlier, and had started converting it to be
buildable in userspace, but I never finished.  (I determined it
wouldn't be a good fit for that application, unfortunately.)  I do
think it might be a good datastructure to expose to base userspace
programs eventually, but you probably don't really want to tackle that
extra work :-).  A hash table is totally fine.

> Yes. I just chose 256 for this test program. Unfortunately, the way mountd.c is
> currently coded, the hash table must be sized before the getmntinfo() call,
> so it must be sized before it knows how many file systems there are.
> Since this is userland and table size isn't a memory issue, I'm just tempted to
> make it large enough for a large server with something like 25,000 file systems.

No objection, so long as people can still run tiny NFS servers on
their Raspberry Pis. :-)

> (I'd guess somewhere in the 256->1024 range would work. I'm not sure what
>  you mean by load factor below 0.8?)

Load factor is just number of valid entries divided by space in the
table.  So if your table has 256 spaces, load factor 0.8 would be
having about 205 valid entries.

> Fortunately, neither ZFS nor UFS uses vfs_getnewfsid() unless there is a collision
> and I don't really care about other file system types.

Ah, sorry — I didn't read carefully enough when looking for fsid initialization.

> After all, it just does
>  a linear search down the list of N file systems right and just about anything
>  should be an improvement.)

Yes :-).

> I added a simple (take the low order bits of val[0]) case to the test. I actually
> suspect any of the hash functions will work well enough, since, as you note, most of the values (val[0] and 24bits of val[1]) are from a random # generator which should
> be pretty uniform in distribution for ZFS.
> UFS now uses a value from the superblock. It appears that newfs sets val[0] to the
> creation time of the file system and val[1] to a random value.

Hm, it looks like makefs sets val[1] (fs_id[1]) to a non-random value
generated by the predictable PRNG random(3), for reproducible build
reasons.  makefs seeds srandom(3) with either the current time in
seconds (low entropy) or some known timestamp (either from a file,
also in seconds, or an integer) (no entropy).  I guess random(3) may
provide better distribution for the table than a plain global counter,
though.

> If it had been the
> reverse, I would be tempted to only use val[0], but I'll see how well the test goes
> for Peter.

Seems like you were right — any old function is good enough :-).

Take care,
Conrad
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Rick Macklem
Conrad Meyer wrote:

>On Wed, May 8, 2019 at 5:41 PM Rick Macklem <[hidden email]> wrote:
>[liberal snipping throughout :-)]
>>
>> I'll admit I had never heard of PCTRIE, but <sys/pctrie.h> seems to be
>> #ifdef _KERNEL, so it can't be used in userland?
>
>Ah, you're completely correct.  I had considered using PCTRIE for a
>userspace application earlier, and had started converting it to be
>buildable in userspace, but I never finished.  (I determined it
>wouldn't be a good fit for that application, unfortunately.)  I do
>think it might be a good datastructure to expose to base userspace
>programs eventually, but you probably don't really want to tackle that
>extra work :-).  A hash table is totally fine.
>
>> Yes. I just chose 256 for this test program. Unfortunately, the way mountd.c is
>> currently coded, the hash table must be sized before the getmntinfo() call,
>> so it must be sized before it knows how many file systems there are.
>> Since this is userland and table size isn't a memory issue, I'm just tempted to
>> make it large enough for a large server with something like 25,000 file systems.
>
>No objection, so long as people can still run tiny NFS servers on
>their Raspberry Pis. :-)
I do now think I can dynamically size it based on how many file systems
(which is how many structures) will be linked off the table, so this shouldn't be
a problem.

>> (I'd guess somewhere in the 256->1024 range would work. I'm not sure what
>>  you mean by load factor below 0.8?)
>
>Load factor is just number of valid entries divided by space in the
>table.  So if your table has 256 spaces, load factor 0.8 would be
>having about 205 valid entries.
Hmm. That would suggest a large table size of about 100,000 for Peter's
72,000 file system case. It would result in a list length averaging about 1 entry,
but I think a linear search through 10-20 entries won't take long enough to be
a problem for this case. (This happens whenever mountd receives a SIGHUP and each time a client does a mount.)

As noted in the last post, I was thinking along the lines of 10->20 as a target
linked list length. (Or "table_size = num / L", where L is the target average list length.
L = 10->20 would be roughly a load average of 10->20.)

Does that sound reasonable? (Interested in hearing from anyone on this.)

>> Fortunately, neither ZFS nor UFS uses vfs_getnewfsid() unless there is a collision
>> and I don't really care about other file system types.
>
>Ah, sorry — I didn't read carefully enough when looking for fsid initialization.
>
>> After all, it just does
>>  a linear search down the list of N file systems right and just about anything
>>  should be an improvement.)
>
>Yes :-).
>
>> I added a simple (take the low order bits of val[0]) case to the test. I actually
>> suspect any of the hash functions will work well enough, since, as you note, most of the values (val[0] and 24bits of val[1]) are from a random # generator which should
>> be pretty uniform in distribution for ZFS.
>> UFS now uses a value from the superblock. It appears that newfs sets val[0] to the
>> creation time of the file system and val[1] to a random value.
>
>Hm, it looks like makefs sets val[1] (fs_id[1]) to a non-random value
>generated by the predictable PRNG random(3), for reproducible build
>reasons.  makefs seeds srandom(3) with either the current time in
>seconds (low entropy) or some known timestamp (either from a file,
>also in seconds, or an integer) (no entropy).  I guess random(3) may
>provide better distribution for the table than a plain global counter,
>though.
Yes. It would be the distribution of values and not their randomness that would
matter for this case. Hopefully someone with a large # of UFS file systems will
run the test program and post the results. To be honest, I doubt if anyone will
create a server with enough UFS file systems for it to be important to hash their
fsid well. It works fine for the small number of UFS file systems I have.)

>> If it had been the
>> reverse, I would be tempted to only use val[0], but I'll see how well the test goes
>> for Peter.
>
>Seems like you were right — any old function is good enough :-).
From Peter's test, the first three did fine and almost the same. The last two weren't
as good, but were still tolerable.

Thanks for the comments, rick
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: test hash functions for fsid

Conrad Meyer-2
Hi Rick,

On Thu, May 9, 2019 at 2:44 PM Rick Macklem <[hidden email]> wrote:
> I do now think I can dynamically size it based on how many file systems
> (which is how many structures) will be linked off the table, so this shouldn't be
> a problem.

Great!

> >Load factor is just number of valid entries divided by space in the
> >table.  So if your table has 256 spaces, load factor 0.8 would be
> >having about 205 valid entries.
>
> Hmm. That would suggest a large table size of about 100,000 for Peter's
> 72,000 file system case. It would result in a list length averaging about 1 entry,

Yeah, that's usually how hash tables are sized — aiming for a list
length of 1 (for chaining hash tables) or relatively short probe
sequence (for open addressing / embedded entry tables).  It's what
provides the various "O(1)" properties hash tables ostensibly have.

> but I think a linear search through 10-20 entries won't take long enough to be
> a problem for this case. (This happens whenever mountd receives a SIGHUP and each time a client does a mount.)

Yes, it might not be noticeable in this application either way.

> As noted in the last post, I was thinking along the lines of 10->20 as a target
> linked list length. (Or "table_size = num / L", where L is the target average list length.
> L = 10->20 would be roughly a load average of 10->20.)
>
> Does that sound reasonable? (Interested in hearing from anyone on this.)

Wikipedia discusses it a little bit:
https://en.wikipedia.org/wiki/Hash_table#Key_statistics

Facebook recently open sourced their general-purpose C++ hashtable
implementations and talk at length about various tradeoffs, including
load: https://code.fb.com/developer-tools/f14/?r=1  It might be
interesting to read.

Anyway, usually it's a number less than 1.  10-20x is unconventional.

> Yes. It would be the distribution of values and not their randomness that would
> matter for this case. Hopefully someone with a large # of UFS file systems will
> run the test program and post the results. To be honest, I doubt if anyone will
> create a server with enough UFS file systems for it to be important to hash their
> fsid well. It works fine for the small number of UFS file systems I have.)

Maybe pho@ will take that as a challenge.  You could create a lot of
UFS filesystems with mdconfig and enough RAM.

> Thanks for the comments, rick

Cheers,
Conrad
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "[hidden email]"