Micro-benchmark for various time syscalls...

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Micro-benchmark for various time syscalls...

Sean Chittenden-4
I wrote a small micro-benchmark utility[1] to test various time  
syscalls and the results were a bit surprising to me.  The results  
were from a UP machine and I believe that the difference between  
gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've been  
bigger on an SMP system and performance would've degraded further with  
each additional core.

clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for  
most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/-  
10ms of CLOCK_REALTIME's value[2]).  In fact, I'd assume that  
CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) (a  
statement I can't back up, but believe is likely to be correct) and  
therefore there isn't much harm (if any) in seeing clock_gettime(2) +  
CLOCK_REALTIME_FAST receive more widespread use vs. gettimeofday(2).  
FYI.  -sc

PS  Is there a reason that time(3) can't be implemented in terms of  
clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to  
time_t's whole second resolution.

% ./bench_time 9079882 | sort -rnk1
Timing micro-benchmark.  9079882 syscall iterations.
Avg. us/call    Elapsed     Name
9.322484    84.647053       gettimeofday(2)
8.955324    81.313291       time(3)
8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)
Last value from gettimeofday(2): 1212380080.620649
Last value from time(3): 1212380161
Last value from clock_gettime(2/CLOCK_VIRTUAL): 2.296430000
Last value from clock_gettime(2/CLOCK_SECOND): 1212380338.000000000
Last value from clock_gettime(2/CLOCK_REALTIME_FAST):  
1212380243.461081040
Last value from clock_gettime(2/CLOCK_REALTIME): 1212380240.459788612
Last value from clock_gettime(2/CLOCK_PROF): 185.560343000
Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 5747219.271879584
Last value from clock_gettime(2/CLOCK_MONOTONIC): 5747216.886509281


[1] http://sean.chittenden.org/pubfiles/freebsd/bench_time.c

[2] sys/time.h comments about precision.  http://fxr.watson.org/fxr/source/sys/time.h#L269

--
Sean Chittenden
[hidden email]
http://sean.chittenden.org/

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Gary Stanley-2
At 12:54 AM 6/2/2008, Sean Chittenden wrote:

>I wrote a small micro-benchmark utility[1] to test various time
>syscalls and the results were a bit surprising to me.  The results
>were from a UP machine and I believe that the difference between
>gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've been
>bigger on an SMP system and performance would've degraded further with
>each additional core.
>
>clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for
>most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/-
>10ms of CLOCK_REALTIME's value[2]).  In fact, I'd assume that
>CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) (a
>statement I can't back up, but believe is likely to be correct) and
>therefore there isn't much harm (if any) in seeing clock_gettime(2) +
>CLOCK_REALTIME_FAST receive more widespread use vs. gettimeofday(2).
>FYI.  -sc
>
>PS  Is there a reason that time(3) can't be implemented in terms of
>clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to
>time_t's whole second resolution.

Another interesting idea is to map gettimeofday() to userland, sort
of like darwin (commpage) and linux (vsyscall) via read only page.

Can you try changing microtime() in kern_time.c:gettimeofday() to
getmicrotime() to see if your benchmarks change any?

Also; what clock are you using for your benchmarks? ACPI? TSC?



_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

kometen
In reply to this post by Sean Chittenden-4
> I wrote a small micro-benchmark utility[1] to test various time syscalls and
> the results were a bit surprising to me.  The results were from a UP machine
> and I believe that the difference between gettimeofday(2) and
> clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and
> performance would've degraded further with each additional core.
>
> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most
> authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of
> CLOCK_REALTIME's value[2]).  In fact, I'd assume that CLOCK_REALTIME_FAST is
> just as accurate as Linux's gettimeofday(2) (a statement I can't back up,
> but believe is likely to be correct) and therefore there isn't much harm (if
> any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more
> widespread use vs. gettimeofday(2).  FYI.  -sc
>
> PS  Is there a reason that time(3) can't be implemented in terms of
> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to time_t's
> whole second resolution.
>
> % ./bench_time 9079882 | sort -rnk1
> Timing micro-benchmark.  9079882 syscall iterations.
> Avg. us/call    Elapsed     Name
> 9.322484    84.647053       gettimeofday(2)
> 8.955324    81.313291       time(3)
> 8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
> 8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
> 0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
> 0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
> 0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
> 0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
> 0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)
> Last value from gettimeofday(2): 1212380080.620649
> Last value from time(3): 1212380161
> Last value from clock_gettime(2/CLOCK_VIRTUAL): 2.296430000
> Last value from clock_gettime(2/CLOCK_SECOND): 1212380338.000000000
> Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212380243.461081040
> Last value from clock_gettime(2/CLOCK_REALTIME): 1212380240.459788612
> Last value from clock_gettime(2/CLOCK_PROF): 185.560343000
> Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 5747219.271879584
> Last value from clock_gettime(2/CLOCK_MONOTONIC): 5747216.886509281

rozetta~/devel/c%>sysctl hw.model
hw.model: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz

rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1
Timing micro-benchmark.  9079882 syscall iterations.
Avg. us/call    Elapsed         Name
1.405469        12.761494       clock_gettime(2/CLOCK_REALTIME)
1.313101        11.922799       time(3)
1.305518        11.853953       clock_gettime(2/CLOCK_MONOTONIC)
1.303947        11.839681       gettimeofday(2)
0.442908        4.021557        clock_gettime(2/CLOCK_PROF)
0.436484        3.963223        clock_gettime(2/CLOCK_VIRTUAL)
0.217718        1.976851        clock_gettime(2/CLOCK_MONOTONIC_FAST)
0.215264        1.954571        clock_gettime(2/CLOCK_REALTIME_FAST)
0.211779        1.922932        clock_gettime(2/CLOCK_SECOND)
Value from time(3): 1212391638
Last value from gettimeofday(2): 1212391626.146308
Last value from clock_gettime(2/CLOCK_VIRTUAL): 4.179847000
Last value from clock_gettime(2/CLOCK_SECOND): 1212391676.000000000
Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212391652.785214038
Last value from clock_gettime(2/CLOCK_REALTIME): 1212391650.830730996
Last value from clock_gettime(2/CLOCK_PROF): 60.276182000
Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 1190915.000747909
Last value from clock_gettime(2/CLOCK_MONOTONIC): 1190913.024357334

gettimeofday is 6 times slower on this system, 28 times slower on your system.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Sean Chittenden-4
On Sun, 1 Jun 2008, Sean Chittenden wrote:

> I wrote a small micro-benchmark utility[1] to test various time syscalls and
> the results were a bit surprising to me.  The results were from a UP machine
> and I believe that the difference between gettimeofday(2) and
> clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and
> performance would've degraded further with each additional core.

I wouldn't expect SMP to make much difference between CLOCK_REALTIME and
CLOCK_REALTIME_FAST.  The only difference is that the former calls
nanotime() where the latter calls getnanotime().  nanotime() always does
more, but it doesn't have any extra SMP overheads in most cases (in rare
cases like i386 using the i8254 timecounter, it needs to lock accesses to
the timecounter hardware).  gettimeofday() always does more than
CLOCK_REALTIME, but again no more for SMP.

> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most
> authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of
> CLOCK_REALTIME's value[2]).  In fact, I'd assume that CLOCK_REALTIME_FAST is
> just as accurate as Linux's gettimeofday(2) (a statement I can't back up, but
> believe is likely to be correct) and therefore there isn't much harm (if any)
> in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use
> vs. gettimeofday(2).  FYI.  -sc

The existence of most of CLOCK_* is a bug.  I wouldn't use CLOCK_REALTIME_FAST
for anything (if only because it doesn't exist in most kernels that I
run.  I switched from using gettimeofday() to CLOCK_REALTIME many years
ago when syscalls started taking less than 1 usec and still occasionally
have problems from this running old kernels, because old i386 kernels
don't support CLOCK_REALTIME and old amd64 kernels have a broken
CLOCK_REALTIME in 32-bit mode).

> PS  Is there a reason that time(3) can't be implemented in terms of
> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to time_t's
> whole second resolution.

I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low
accuracy timers provided by the get*time() family are accurate enough
to give the time in seconds.  Unfortunately, they are still broken --
they are all incoherent relative to nanotime() and some are incoherent
relative to each other.  CLOCK_SECOND can lag the time in seconds given
by up to tc_tick/HZ seconds.  This is because CLOCK_SECOND returns the
time in seconds at the last tc_windup(), so it misses seeing rollovers
of the second in the interval between the rollover and the next
tc_windup(), while nanotime() doesn't miss seeing these rollovers so
it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct
and time_second/CLOCK_SECOND broken.  vfs_timestamp() already defaults
to using time_second, so it gives times incoherent with time() since
the latter still uses getttimeofday().  Some file system test programs
see this incoherency and I run them with vfs.timestamp.precision=3
(nanotime()) to avoid it.  File systems were micro-optimized to use
time_second (now not so micro optimized to use vfs_timestamp() which
defaults to using time_second), but micro-pessimizing them to use
nanotime() makes no significant difference.  This is because most file
system timestamp updates are cached (delayed until the next sync or
disk write), and in cases where the updates are written to disk the
time to read the clock is in the noise relative to the time for the
disk write.

>
> % ./bench_time 9079882 | sort -rnk1
> Timing micro-benchmark.  9079882 syscall iterations.
> Avg. us/call    Elapsed     Name
> 9.322484    84.647053       gettimeofday(2)
> 8.955324    81.313291       time(3)
> 8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
> 8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
> 0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
> 0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
> 0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
> 0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
> 0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)

These are very slow.  Are they on a 486? :-)  I get about 262 ns for
CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
The syscall overhead is about 200 nsec (170 nsec for a simpler syscall
and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading
the TSC timecounter adds another 60 nsec, including a whole 6 nsec for
the hardware part of the read (perhaps more like 30 nsec than 60 for the
whoe read).  The TSC doesn't work on all machines (never for SMP), but
this will hopefully change.  (Phenom is supposed to have TSCs that are
coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+
to implement this :-(.  Core2 already has a 40+ cycles rdtsc, but AFAIK
it doesn't have coherent TSCs.)  Other timecounters are much slower than
the TSC, but I haven't seen one take 8000 nsec since 486 days.

Some of my benchmark results:

2.205GHz A64 in 32-bit mode, VIA motherboard:
%%%
2008/01/05 (TSC) bde-current, -O2 -mcpu=athlon-xp
min 240, max 77658, mean 242.171787, std 65.655259

2007/11/23 (TSC) bde-current
min 247, max 11890, mean 247.857786, std 62.559317

2007/05/19 (TSC) plain -current-noacpi
min 262, max 286965, mean 263.941187, std 41.801400

2007/05/19 (TSC) plain -current-acpi
min 261, max 68926, mean 279.848650, std 40.477440

2007/05/19 (ACPI-fast timecounter) plain -current-acpi
min 558, max 285494, mean 827.597038, std 78.322301

2007/05/19 (i8254) plain -current-acpi
min 3352, max 288288, mean 4182.774148, std 257.977752
%%%

These times are for CLOCK_REALTIME.

This system has a fairly fast ACPI and i8254 timecounters.  1500-800
nsec is more typical for ACPI-fast, and 4000-5000 is more typical
for i8254.  ACPI-fast should be named ACPI-not-very-slow.  ACPI-safe
is very slow, perhaps slower than i8254.  i8254 could be made about
twice as fast if anyone cared.

133MHz P1:
%%%
1996/07/12:
min 3, max 472, mean 3.320346, std 0.694846

1998/02/21 pre-phk:
min 3, max 595, mean 3.443382, std 0.767383

1998/02/21 post-phk:
min 4, max 99, mean 4.614527, std 0.710407

1999/12/04:
min 4, max 120, mean 4.630231, std 0.777733

2000/09/29:
min 5, max 203, mean 5.376130, std 1.912127

2001/05/19:
min 6, max 1715, mean 6.783378, std 2.015211

2001/09/02:
min 5, max 482, mean 5.474384, std 2.683939
%%%

These times are for gettimeofday().  Note that there are now in usec.
The timecounter is always the TSC (post-phk) or uses the TSC more
directly (pre-phk).  These times  serve mainly to document time bloat
due to timecounters and SMPng.  The P1 has limited caching and suffers
more from longer code paths than new CPUs.

66MHz 486DX2:
%%%
1995/11/03:
min 13, max 171, mean 14.286634, std 1.836667

2000/11/15:
min 20, max 542, mean 21.843003, std 8.003137
%%%

Here the timecounter is always the i8254.  These times serve mainly
as a reminder of how slow old machines were.  The i8254 timecounter
hardware didn't take any longer back then (it was probably faster,
since old machines didn't have PCI bridges, and they had tunable ISA
wait states which I tuned), but a simple syscall took 7.2 usec and
gettimeofday() took much longer.  The bloat between 1995 and 2000 was
relatively similar to that on the P1 system.

Other implementation bugs (all in clock_getres()):
- all of the clock ids that use getnanotime() claim a resolution of 1
   nsec, but that us bogus.  The actual resolution is more like tc_tick/HZ.
   The extra resolution in a struct timespec is only used to return
   garbage related to the incoherency of the clocks.  (If it could be
   arranged that tc_windup() always ran on a tc_tick/HZ boundary, then
   the clocks would be coherent and the times would always be a multiple
   of tc_tick/HZ, with no garbage in low bits.)
- CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is
   bogus.  The actual resolution is more like 1/stathz, or perhaps 1
   microsecond.  hz is irrelevant here since statclock ticks are used.
   statclock ticks only have a resolution of 1/stathz, but if 1 nsec is
   correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since
   caclru() calculates the time to a resolution of 1 usec; it is just
   very inaccurate at that resolution.
"Resolution" is a poor term for the functionality needed here.  I think
a hint about the accuracy is more important.  In simple implementations
using interrupts and ticks, the accuracy would be about the the same as
the resolution, but FreeBSD is more complicated.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Gary Stanley-2
On Mon, 2 Jun 2008, Gary Stanley wrote:

> At 12:54 AM 6/2/2008, Sean Chittenden wrote:
>> PS  Is there a reason that time(3) can't be implemented in terms of
>> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to
>> time_t's whole second resolution.
>
> Another interesting idea is to map gettimeofday() to userland, sort of like
> darwin (commpage) and linux (vsyscall) via read only page.

time() can reasonably be implemented like that, but not gettimeofday().
gettimeofday() should have an accuracy of 1 usec and it returns a large
data structure that cannot be locked by simple atomic accesses.  The
read-only page would have to be updated millions of times per second
or take a pagefault to access to give the same functionality as FreeBSD
gettimeofday().  The updates would cost about 100% of 1 CPU.  Other
CPUs could then read the time using locking like that in binuptime()
but more complicated (needs an atomic update for at least the generation
count, and probably more).  The pagefaults would give a smaller
pessimization (I guess slightly longer to reach microtime() than via
the current syscall, and identical time in microtime() to do the update
on demand).

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by kometen
On Mon, 2 Jun 2008, Claus Guttesen wrote:

>> % ./bench_time 9079882 | sort -rnk1
>> Timing micro-benchmark.  9079882 syscall iterations.
>> Avg. us/call    Elapsed     Name
>> 9.322484    84.647053       gettimeofday(2)
>> 8.955324    81.313291       time(3)
>> 8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
>> 8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
>> 0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
>> 0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
>> 0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
>> 0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
>> 0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)

In previous mail, I said that these were very slow.

> rozetta~/devel/c%>sysctl hw.model
> hw.model: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz
>
> rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1
> Timing micro-benchmark.  9079882 syscall iterations.
> Avg. us/call    Elapsed         Name
> 1.405469        12.761494       clock_gettime(2/CLOCK_REALTIME)
> 1.313101        11.922799       time(3)
> 1.305518        11.853953       clock_gettime(2/CLOCK_MONOTONIC)
> 1.303947        11.839681       gettimeofday(2)
> 0.442908        4.021557        clock_gettime(2/CLOCK_PROF)
> 0.436484        3.963223        clock_gettime(2/CLOCK_VIRTUAL)
> 0.217718        1.976851        clock_gettime(2/CLOCK_MONOTONIC_FAST)
> 0.215264        1.954571        clock_gettime(2/CLOCK_REALTIME_FAST)
> 0.211779        1.922932        clock_gettime(2/CLOCK_SECOND)

These seem about right for a normal untuned ~2GHz system:
- there is a syscall overhead of about 200 nsec
- the hardware parts of the ACPI (?) timecounter are very slow, so they
   add 1100 nsec
- anomalous extra 100 nsec for CLOCK_REALTIME.  CLOCK_REALTIME does less than
   gettimeofday().
- CLOCK_PROF and CLOCK_VIRTUAL use the slow function calcru() in the kernel.
   This apparently takes about the same time as a syscall.  calcru() uses
   cpu_ticks() (which normally uses the TSC on i386 and amd64) to determine
   the time spent since the thread was last context switched, so it is more
   accurate than CLOCK_REALTIME_FAST but less accurate than CLOCK_REALTIME;
   using the TSC makes it faster than a non-TSC timecounter.  calcru() still
   seems to have broken accounting for the current timeslice in other running
   threads in the process.

> gettimeofday is 6 times slower on this system, 28 times slower on your system.

1.epsilon times slower on my system :-).

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Sean Chittenden-4
In reply to this post by Bruce Evans-4
>> I wrote a small micro-benchmark utility[1] to test various time  
>> syscalls and the results were a bit surprising to me.  The results  
>> were from a UP machine and I believe that the difference between  
>> gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've  
>> been bigger on an SMP system and performance would've degraded  
>> further with each additional core.
>
> I wouldn't expect SMP to make much difference between CLOCK_REALTIME  
> and
> CLOCK_REALTIME_FAST.  The only difference is that the former calls
> nanotime() where the latter calls getnanotime().  nanotime() always  
> does
> more, but it doesn't have any extra SMP overheads in most cases (in  
> rare
> cases like i386 using the i8254 timecounter, it needs to lock  
> accesses to
> the timecounter hardware).  gettimeofday() always does more than
> CLOCK_REALTIME, but again no more for SMP.

You may be right, I can only speculate.  Going off of phk@'s  
rhetorical questions regarding gettimeofday(2) working across cores/
threads, I assumed there would be some kind of synchronization.

http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html

>> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for  
>> most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/-  
>> 10ms of CLOCK_REALTIME's value[2]).  In fact, I'd assume that  
>> CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2)  
>> (a statement I can't back up, but believe is likely to be correct)  
>> and therefore there isn't much harm (if any) in seeing  
>> clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use  
>> vs. gettimeofday(2).  FYI.  -sc
>
> The existence of most of CLOCK_* is a bug.  I wouldn't use  
> CLOCK_REALTIME_FAST
> for anything (if only because it doesn't exist in most kernels that I
> run.

I think that's debatable, actually.  I modified my little micro-
benchmark program to test the realtime values returned from each  
execution and found that CLOCK_REALTIME_FAST likely updates itself  
sufficiently frequently for most applications (not all, but most).  My  
test ensures that time doesn't go backwards and tally's the number of  
times that the values are identical.  It'd be nice of  
CLOCK_REALTIME_FAST incremented by a small and reasonable fudge factor  
every time it's invoked that way the values aren't identical.

On my machine, I can make 100K gettimeofday(2) calls compared to 3M  
CLOCK_REALTIME_FAST calls, which is a significantly large delta when  
you're aiming for software that's handling around ~40-50Kpps and want  
to include time information periodically (see above comment about a  
fudge factor being included after every call *grin* ).

http://sean.chittenden.org/pubfiles/freebsd/bench_clock_realtime.c

% ./bench_clock_realtime 9079882 | sort -rnk1
clock realtime micro-benchmark.  9079882 syscall iterations.
Avg. us/call Elapsed Name
9.317078 84.597968 gettimeofday(2)
8.960372 81.359120 time(3)
8.776467 79.689287 clock_gettime(2/CLOCK_REALTIME)
0.332357 3.017763 clock_gettime(2/CLOCK_REALTIME_FAST)
0.311705 2.830246 clock_gettime(2/CLOCK_SECOND)
Value from time(3): 1212427374
Last value from gettimeofday(2): 1212427293.590511 Equal: 0
Last value from clock_gettime(2/CLOCK_SECOND): 1212427460.000000000
Equal: 9079878
Last value from clock_gettime(2/CLOCK_REALTIME_FAST):  
1212427457.656410126 Equal: 9078198
Last value from clock_gettime(2/CLOCK_REALTIME): 1212427454.639076390
Equal: 0

% irb
 >> tot = 9079882
=> 9079882
 >> eq = 9078198
=> 9078198
 >> tot - eq
=> 1684
 >> time = 3.017763
=> 3.017763
 >> (tot - eq) / time
=> 558.029242190324
 >> tot / time
=> 3008812.15655437  # number of CLOCK_REALTIME_FAST calls per second
 >> tot / 84.597968
=> 107329.788346689  # number of gettimeofday(2) calls per second


> I switched from using gettimeofday() to CLOCK_REALTIME many years
> ago when syscalls started taking less than 1 usec and still  
> occasionally
> have problems from this running old kernels, because old i386 kernels
> don't support CLOCK_REALTIME and old amd64 kernels have a broken
> CLOCK_REALTIME in 32-bit mode).

Entirely possible that's why things are more expensive on my test  
machine.

% sysctl hw.model
hw.model: AMD Athlon(tm) 64 Processor 3500+
% uname -a
FreeBSD dev2.office.chittenden.org 7.0-RELEASE FreeBSD 7.0-RELEASE #0:  
Sun Feb 24 10:35:36 UTC 2008     [hidden email]:/usr/
obj/usr/src/sys/GENERIC  amd64


>> PS  Is there a reason that time(3) can't be implemented in terms of  
>> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to  
>> time_t's whole second resolution.
>
> I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low
> accuracy timers provided by the get*time() family are accurate enough
> to give the time in seconds.  Unfortunately, they are still broken --
> they are all incoherent relative to nanotime() and some are incoherent
> relative to each other.  CLOCK_SECOND can lag the time in seconds  
> given
> by up to tc_tick/HZ seconds.  This is because CLOCK_SECOND returns the
> time in seconds at the last tc_windup(), so it misses seeing rollovers
> of the second in the interval between the rollover and the next
> tc_windup(), while nanotime() doesn't miss seeing these rollovers so
> it gives incoherent times, with nanotime()/CLOCK_REALTIME being  
> correct
> and time_second/CLOCK_SECOND broken.

Interesting.  Incoherent, but accurate enough?  We're talking about a  
<10ms window of incoherency, right?

>> % ./bench_time 9079882 | sort -rnk1
>> Timing micro-benchmark.  9079882 syscall iterations.
>> Avg. us/call    Elapsed     Name
>> 9.322484    84.647053       gettimeofday(2)
>> 8.955324    81.313291       time(3)
>> 8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
>> 8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
>> 0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
>> 0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
>> 0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
>> 0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
>> 0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)
>
> These are very slow.  Are they on a 486? :-)  I get about 262 ns for
> CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
> The syscall overhead is about 200 nsec (170 nsec for a simpler syscall
> and maybe 30 nsec extra for copyin/out for clock_gettime()) and  
> reading
> the TSC timecounter adds another 60 nsec, including a whole 6 nsec for
> the hardware part of the read (perhaps more like 30 nsec than 60 for  
> the
> whoe read).  The TSC doesn't work on all machines (never for SMP), but
> this will hopefully change.  (Phenom is supposed to have TSCs that are
> coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+
> to implement this :-(.  Core2 already has a 40+ cycles rdtsc, but  
> AFAIK
> it doesn't have coherent TSCs.)  Other timecounters are much slower  
> than
> the TSC, but I haven't seen one take 8000 nsec since 486 days.

*shrug*  elapsed / number of calls.  Not doing anything fancy here.

> Some of my benchmark results:

Can I run this same test/see how this was written?

> This system has a fairly fast ACPI and i8254 timecounters.  1500-800
> nsec is more typical for ACPI-fast, and 4000-5000 is more typical
> for i8254.  ACPI-fast should be named ACPI-not-very-slow.  ACPI-safe
> is very slow, perhaps slower than i8254.  i8254 could be made about
> twice as fast if anyone cared.

Hrm.

% sysctl -a | grep -i acpi_timer
machdep.acpi_timer_freq: 3579545
dev.acpi_timer.0.%desc: 24-bit timer at 3.579545MHz
dev.acpi_timer.0.%driver: acpi_timer
dev.acpi_timer.0.%location: unknown
dev.acpi_timer.0.%pnpinfo: unknown
dev.acpi_timer.0.%parent: acpi0
% sysctl -a | grep -i tsc
kern.timecounter.choice: TSC(800) ACPI-safe(850) i8254(0)  
dummy(-1000000)
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2749242907
kern.timecounter.tc.TSC.frequency: 2222000000
kern.timecounter.tc.TSC.quality: 800
kern.timecounter.smp_tsc: 0
machdep.tsc_freq: 2222000000

> Other implementation bugs (all in clock_getres()):
> - all of the clock ids that use getnanotime() claim a resolution of 1
>  nsec, but that us bogus.  The actual resolution is more like  
> tc_tick/HZ.
>  The extra resolution in a struct timespec is only used to return
>  garbage related to the incoherency of the clocks.  (If it could be
>  arranged that tc_windup() always ran on a tc_tick/HZ boundary, then
>  the clocks would be coherent and the times would always be a multiple
>  of tc_tick/HZ, with no garbage in low bits.)
> - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is
>  bogus.  The actual resolution is more like 1/stathz, or perhaps 1
>  microsecond.  hz is irrelevant here since statclock ticks are used.
>  statclock ticks only have a resolution of 1/stathz, but if 1 nsec is
>  correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since
>  caclru() calculates the time to a resolution of 1 usec; it is just
>  very inaccurate at that resolution.
> "Resolution" is a poor term for the functionality needed here.  I  
> think
> a hint about the accuracy is more important.  In simple  
> implementations
> using interrupts and ticks, the accuracy would be about the the same  
> as
> the resolution, but FreeBSD is more complicated.

Is there any reason that the garbage resolution can't be zero'ed out  
to indicate confidence of the kernel in the precision of the  
information?  -sc

--
Sean Chittenden
[hidden email]
http://sean.chittenden.org/

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Sean Chittenden-4
In reply to this post by Bruce Evans-4
>> rozetta~/devel/c%>sysctl hw.model
>> hw.model: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz
>>
>> rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1
>> Timing micro-benchmark.  9079882 syscall iterations.
>> Avg. us/call    Elapsed         Name
>> 1.405469        12.761494       clock_gettime(2/CLOCK_REALTIME)
>> 1.313101        11.922799       time(3)
>> 1.305518        11.853953       clock_gettime(2/CLOCK_MONOTONIC)
>> 1.303947        11.839681       gettimeofday(2)
>> 0.442908        4.021557        clock_gettime(2/CLOCK_PROF)
>> 0.436484        3.963223        clock_gettime(2/CLOCK_VIRTUAL)
>> 0.217718        1.976851        clock_gettime(2/CLOCK_MONOTONIC_FAST)
>> 0.215264        1.954571        clock_gettime(2/CLOCK_REALTIME_FAST)
>> 0.211779        1.922932        clock_gettime(2/CLOCK_SECOND)
>
> These seem about right for a normal untuned ~2GHz system:

This begs the question, tuning for time calls.  Do you have a best  
practice that you use for reducing the cost of time calls?  -sc

--
Sean Chittenden
[hidden email]
http://sean.chittenden.org/

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Gary Stanley-2
In reply to this post by Bruce Evans-4
At 06:55 AM 6/2/2008, Bruce Evans wrote:

>On Mon, 2 Jun 2008, Gary Stanley wrote:
>
>>At 12:54 AM 6/2/2008, Sean Chittenden wrote:
>>>PS  Is there a reason that time(3) can't be implemented in terms of
>>>clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to
>>>time_t's whole second resolution.
>>
>>Another interesting idea is to map gettimeofday() to userland, sort
>>of like darwin (commpage) and linux (vsyscall) via read only page.
>
>time() can reasonably be implemented like that, but not gettimeofday().
>gettimeofday() should have an accuracy of 1 usec and it returns a large
>data structure that cannot be locked by simple atomic accesses.  The
>read-only page would have to be updated millions of times per second
>or take a pagefault to access to give the same functionality as FreeBSD
>gettimeofday().  The updates would cost about 100% of 1 CPU.  Other
>CPUs could then read the time using locking like that in binuptime()
>but more complicated (needs an atomic update for at least the generation
>count, and probably more).  The pagefaults would give a smaller
>pessimization (I guess slightly longer to reach microtime() than via
>the current syscall, and identical time in microtime() to do the update
>on demand).

Here's a sloppy thought :) What about just rewriting gettimeofday in
libc to query the TSC and convert it to usecs etc? That would
eliminate any costly userland -> kernel overhead. I have a proof of
concept here to do this.

The only bad thing is the skewing of the TSC..





_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Gary Stanley-2
In reply to this post by Bruce Evans-4
At 06:19 AM 6/2/2008, Bruce Evans wrote:

>These are very slow.  Are they on a 486? :-)  I get about 262 ns for
>CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
>The syscall overhead is about 200 nsec (170 nsec for a simpler syscall
>and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading
>the TSC timecounter adds another 60 nsec, including a whole 6 nsec for
>the hardware part of the read (perhaps more like 30 nsec than 60 for the
>whoe read).  The TSC doesn't work on all machines (never for SMP), but
>this will hopefully change.  (Phenom is supposed to have TSCs that are
>coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+
>to implement this :-(.  Core2 already has a 40+ cycles rdtsc, but AFAIK
>it doesn't have coherent TSCs.)  Other timecounters are much slower than
>the TSC, but I haven't seen one take 8000 nsec since 486 days.

Phenom's don't have TSCs that are coherent, as least on a few machines here:

4 CPUs, running 4 parallel test-tasks.
checking for time-warps via:
- read time stamp counter (RDTSC) instruction (cycle resolution)
- gettimeofday (TOD) syscall (usec resolution)
- clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)

new TSC-warp maximum: -4294919263 cycles, 00000000ffffe11b -> 0000000000009cbc
new TSC-warp maximum: -4294919300 cycles, 00000000ffff74e4 -> 0000000000003060
  | TSC: 2.24us, fail:3 | TOD: 2.24us, fail:0 | CLK: 2.24us, fail:0 |

The code to test the TSC to check for warping:

http://leaf.dragonflybsd.org/~gary/tests/time-warp-test.c

However, it seems that Core2's don't have any warping of the TSC. I
tested that code on a core2quad for 8 hours with no TSC failures.



_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Bruce Evans-4
On Mon, 2 Jun 2008, Gary Stanley wrote:

> At 06:55 AM 6/2/2008, Bruce Evans wrote:
>> On Mon, 2 Jun 2008, Gary Stanley wrote:
>>
>>> At 12:54 AM 6/2/2008, Sean Chittenden wrote:
>>>> PS  Is there a reason that time(3) can't be implemented in terms of
>>>> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to
>>>> time_t's whole second resolution.
>>>
>>> Another interesting idea is to map gettimeofday() to userland, sort of
>>> like darwin (commpage) and linux (vsyscall) via read only page.
>>
>> time() can reasonably be implemented like that, but not gettimeofday().
>> gettimeofday() should have an accuracy of 1 usec and it returns a large
>> data structure that cannot be locked by simple atomic accesses...
>
> Here's a sloppy thought :) What about just rewriting gettimeofday in libc to
> query the TSC and convert it to usecs etc? That would eliminate any costly
> userland -> kernel overhead. I have a proof of concept here to do this.

This is hard enough to do in the kernel.  The result is the TSC timecounter,
which is too hard to make work properly (coherently and without interference
from power saving, etc., changing the clock frequency, and on arches that
don't have a TSC, and on arches that have a TSC whose access methods are
spelled differently than on i386...), except on some machines.

> The only bad thing is the skewing of the TSC..

Closer to impossible to handle in userland.

Of course, some userland benchmarks that don't need very precise timing can
just call rdtsc() and depend on the frequency not changing too much while
the benchmark is running.  Process times in the kernel use essentially
this method.o

Another complication with using the TSC is that it executes out of order
on many (i386/amd64) CPU types.  So rdtsc's inside short sections of code
don't work right.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Bruce Evans-4
On Mon, 2 Jun 2008, Gary Stanley wrote:

> At 06:19 AM 6/2/2008, Bruce Evans wrote:
>
>> These are very slow.  Are they on a 486? :-)  I get about 262 ns for
>> CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
>> The syscall overhead is about 200 nsec (170 nsec for a simpler syscall
>> and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading
>> the TSC timecounter adds another 60 nsec, including a whole 6 nsec for
>> the hardware part of the read (perhaps more like 30 nsec than 60 for the
>> whoe read).  The TSC doesn't work on all machines (never for SMP), but
>> this will hopefully change.  (Phenom is supposed to have TSCs that are
>> coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+
>> to implement this :-(.  Core2 already has a 40+ cycles rdtsc, but AFAIK
>> it doesn't have coherent TSCs.)  Other timecounters are much slower than
>> the TSC, but I haven't seen one take 8000 nsec since 486 days.
>
> Phenom's don't have TSCs that are coherent, as least on a few machines here:

According to the amd64 arch manual (volume 3 3.14 Sep 2007):

If CPUID 8000_0007.edx[8] = 1, then [details about hardware states...]
then the TSC is suitable for use as a source of time.  Google shows
support for this feature in at least Linux and Xen.

Phenom also has a rdtscp instruction which is serializing.

> 4 CPUs, running 4 parallel test-tasks.
> checking for time-warps via:
> - read time stamp counter (RDTSC) instruction (cycle resolution)
> - gettimeofday (TOD) syscall (usec resolution)
> - clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)
>
> new TSC-warp maximum: -4294919263 cycles, 00000000ffffe11b ->
> 0000000000009cbc
> new TSC-warp maximum: -4294919300 cycles, 00000000ffff74e4 ->
> 0000000000003060
> | TSC: 2.24us, fail:3 | TOD: 2.24us, fail:0 | CLK: 2.24us, fail:0 |

The difference seems to be only about -0x6000, with an overflow bug in
the test giving a value near -2^32.

> The code to test the TSC to check for warping:
>
> http://leaf.dragonflybsd.org/~gary/tests/time-warp-test.c

> However, it seems that Core2's don't have any warping of the TSC. I tested
> that code on a core2quad for 8 hours with no TSC failures.

Interesting.  Please check the manual.  I don't have current Intel arch
manuals handy

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Sean Chittenden-4
On Mon, 2 Jun 2008, Sean Chittenden wrote:

>>> rozetta~/devel/c%>sysctl hw.model
>>> hw.model: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz
>>>
>>> rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1
>>> Timing micro-benchmark.  9079882 syscall iterations.
>>> Avg. us/call    Elapsed         Name
>>> 1.405469        12.761494       clock_gettime(2/CLOCK_REALTIME)
>>> ...
>>
>> These seem about right for a normal untuned ~2GHz system:
>
> This begs the question, tuning for time calls.  Do you have a best practice
> that you use for reducing the cost of time calls?  -sc

At least try all possible time counters, and choose the one that works best.
Best == fastest and accurate enough.  Best != highest quality according to
kernel hard-coded quality numbers.  ntp will tell you if it isn't accurate
enough if this isn't obvious.

This normally means the TSC on UP systems without power management and
ACPI-fast otherwise.  The kernel quality parameter gives too much
preference to ACPI-fast.

Switching between all possible timecounters at runtime is easier in
not very old versions of FreeBSD.  Old versions didn't even list all
timecounters considered at boot time.  Some timecounters, e.g., HPET
and of course ACPI* on non-ACPI systems are not available even if the
hardware supports them unless they are configured at compile time or
boot time.  It's hard to test the HPET counter on new FreeBSD cluster
machines because it is not confiugured and it would require privilege
to use if it were configured but not selected.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Micro-benchmark for various time syscalls...

Bruce Evans-4
In reply to this post by Sean Chittenden-4
On Mon, 2 Jun 2008, Sean Chittenden wrote:

>> I wouldn't expect SMP to make much difference between CLOCK_REALTIME and
>> CLOCK_REALTIME_FAST.  The only difference is that the former calls
>> nanotime() where the latter calls getnanotime().  nanotime() always does
>> more, but it doesn't have any extra SMP overheads in most cases (in rare
>> cases like i386 using the i8254 timecounter, it needs to lock accesses to
>> the timecounter hardware).  gettimeofday() always does more than
>> CLOCK_REALTIME, but again no more for SMP.
>
> You may be right, I can only speculate.  Going off of phk@'s rhetorical
> questions regarding gettimeofday(2) working across cores/threads, I assumed
> there would be some kind of synchronization.
>
> http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html

The synchronization is all in binuptime().  It is quite delicate.  It
depends mainly on a unlocked, nonatomically-accessed generation count
for software synchronization and the hardware being almost-automatically
synchronized with itself for hardware synchronization.  It takes various
magic for an unlocked, non-atomically accessed generation count to work.
Since it has no locking and executes identical code for SMP and !SMP, it
has identical overheads for SMP and !SMP.  Hardware is almost-automatically
synchronized with itself by using identical hardware for all CPUs.  This
is what breaks down for the TSC on SMP systems (power management may affect
both).  Some hardware timecounters like the i8254 require locking to give
exclusive access to the hardware.

>>> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most
>>> authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of
>>> CLOCK_REALTIME's value[2]).  In fact, I'd assume that CLOCK_REALTIME_FAST
>>> is just as accurate as Linux's gettimeofday(2) (a statement I can't back
>>> up, but believe is likely to be correct) and therefore there isn't much
>>> harm (if any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive
>>> more widespread use vs. gettimeofday(2).  FYI.  -sc
>>
>> The existence of most of CLOCK_* is a bug.  I wouldn't use
>> CLOCK_REALTIME_FAST
>> for anything (if only because it doesn't exist in most kernels that I
>> run.
>
> I think that's debatable, actually.  I modified my little micro-benchmark

It's debateable, but not with me :-).

> program to test the realtime values returned from each execution and found
> that CLOCK_REALTIME_FAST likely updates itself sufficiently frequently for
> most applications (not all, but most).  My test ensures that time doesn't go
> backwards and tally's the number of times that the values are identical.
> It'd be nice of CLOCK_REALTIME_FAST incremented by a small and reasonable
> fudge factor every time it's invoked that way the values aren't identical.

I would probably go direct to the hardware if doing a large enough number
of measurements for clock granularity of access overheads to matter.
Otherwise, CLOCK_REALTIME or CLOCK_MONOTIC is best.  These are easy to use
and give the most accurate results possible.

>>> PS  Is there a reason that time(3) can't be implemented in terms of
>>> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to
>>> time_t's whole second resolution.
>>
>> I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low
>> accuracy timers provided by the get*time() family are accurate enough
>> to give the time in seconds.  Unfortunately, they are still broken --
>> they are all incoherent relative to nanotime() and some are incoherent
>> relative to each other.  CLOCK_SECOND can lag the time in seconds given
>> by up to tc_tick/HZ seconds.  This is because CLOCK_SECOND returns the
>> time in seconds at the last tc_windup(), so it misses seeing rollovers
>> of the second in the interval between the rollover and the next
>> tc_windup(), while nanotime() doesn't miss seeing these rollovers so
>> it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct
>> and time_second/CLOCK_SECOND broken.
>
> Interesting.  Incoherent, but accurate enough?  We're talking about a <10ms
> window of incoherency, right?

Yes.  10ms is a lot.  It results in about 1 in every 100 timestamps being
coherent, so my fs benchmark that tests for file times being coherent
(it actually tests for ctime/mtime/atime updates happening in the correcy
order when file times are incoherent with time(1)) doesn't have to run
for very long to find an incoherency.  After rounding the times to a seconds
boundary, the amount of the incoherency is rounded up from 1-10ms to 1
second.  Incoherencies of 1 second persist for the length of the window.
The delicate locking in binuptime() doesn't allow the data structure updates
that would be required to make all the access methods coherent.  Full
locking would probably be required for that.

>> Some of my benchmark results:
>
> Can I run this same test/see how this was written?

It is an old sanity test program by wollman which I've touched as little
as possible, just to convert to CLOCK_REALTIME and to hack around some
bugs involving array overruns which became larger with the larger range
of values in nanoseconds.  He probably doesn't want to see it, but I
will include it here :-).

%%%
#include <sys/types.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
#include <math.h>
#include <limits.h>
#include <string.h>

#define N 2000000

int diffs[N];
int hist[N * 10]; /* XXX various assumptions on diffs */

int main(void) {
   int i, j;
   int min, max;
   double sum, mean, var, sumsq;
   struct timespec tv, otv;

   memset(diffs, '\0', sizeof diffs); /* fault in whole array, we hope */
   for(i = 0; i < N; i++) {
     clock_gettime(CLOCK_REALTIME, &tv);
     do {
       otv = tv;
       clock_gettime(CLOCK_REALTIME, &tv);
     } while(tv.tv_sec == otv.tv_sec && tv.tv_nsec == otv.tv_nsec);
     diffs[i] = tv.tv_nsec - otv.tv_nsec + 1000000000 * (tv.tv_sec - otv.tv_sec);
   }

   min = INT_MAX;
   max = INT_MIN;
   sum = 0;
   sumsq = 0;
   for(i = 0; i < N; i++) {
     if(diffs[i] > max) max = diffs[i];
     if(diffs[i] < min) min = diffs[i];
     sum += diffs[i];
     sumsq += diffs[i] * diffs[i];
   }

   mean = sum / (double)N;
   var = (sumsq - 2 * mean * sum + sum * mean) / (double)N;

   printf("min %d, max %d, mean %f, std %f\n", min, max, mean, sqrt(var));

   for(i = 0; i < N; i++) {
     hist[diffs[i]]++;
   }

   for(j = 0; j < 5; j++) {
     max = 0;
     min = 0;
     for(i = 0; i < N; i++) {
       if(hist[i] > max) {
         max = hist[i];
         min = i;                /* xxx */
       }
     }
     printf("%dth: %d (%d observations)\n", j + 1, min, max);
     hist[min] = 0;
   }

   return 0;
}
%%%

>> Other implementation bugs (all in clock_getres()):
>> - all of the clock ids that use getnanotime() claim a resolution of 1
>>  nsec, but that us bogus.  The actual resolution is more like tc_tick/HZ.
>>  The extra resolution in a struct timespec is only used to return
>>  garbage related to the incoherency of the clocks.  (If it could be
>>  arranged that tc_windup() always ran on a tc_tick/HZ boundary, then
>>  the clocks would be coherent and the times would always be a multiple
>>  of tc_tick/HZ, with no garbage in low bits.)
>> - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is
>>  bogus.  The actual resolution is more like 1/stathz, or perhaps 1
>>  microsecond.  hz is irrelevant here since statclock ticks are used.
>>  statclock ticks only have a resolution of 1/stathz, but if 1 nsec is
>>  correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since
>>  caclru() calculates the time to a resolution of 1 usec; it is just
>>  very inaccurate at that resolution.
>> "Resolution" is a poor term for the functionality needed here.  I think
>> a hint about the accuracy is more important.  In simple implementations
>> using interrupts and ticks, the accuracy would be about the the same as
>> the resolution, but FreeBSD is more complicated.
>
> Is there any reason that the garbage resolution can't be zero'ed out to
> indicate confidence of the kernel in the precision of the information?  -sc

Well, I only recently decided that "garbage" is the right way to think
of the extra precision.  Some care would be required to not increase
incoherency when discarding the garbage.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"