HyperThreading on Intel Xeon Haswell, a benefit?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

HyperThreading on Intel Xeon Haswell, a benefit?

grarpamp
HyperThreading on Intel Xeon Haswell, a benefit?

What bits of FreeBSD are aware and can take proper advantage of
Intel HTT, such as its thread/process schedulers (sched-BSD/ULE/...),
etc?

What system/app loads are, or are not, likely to benefit with today's
HyperThreading CPU's? Kernel (ZFS/crypto/net/...) vs.  Userland (apps)?

Does anyone have performance stats for this current class of CPU
to post comparing HT (enabled and disabled) while using more than
four processes/threads in parallel?

For instance, these two Intel Xeon Haswell four core CPU's are
identical except for HT [1] (e3-1226v3 and e3-1246v3), and you
can always turn HT off for testing.
http://ark.intel.com/compare/80917,80916

There are some Core i3/i5/i7 Haswell parts with HT as well.
http://ark.intel.com/Search/Advanced?s=t&ECCMemory=true&VTD=true&AESTech=true

There don't seem to be many reviews of Xeon processors, let alone
HT. And most Unix talk of HT seems dated by at least a few years
and a couple processor generations.

Also, was the HT cache leak security issue from a decade ago ever
fixed in hardware?
"Cache missing for fun and profit"
http://www.daemonology.net/papers/

Being unsure of the best list, please direct replies to whichever
is good. Thanks.

[1] Plus 200MHz/6% clock per core and $59/27% market price bumps,
but this thread is about whether or not there is any benefit to HT
in current Intel CPU's such as Haswell, how much of one, and where.
Once that is determined, then you can factor in other parameters
like these to see if it's an overall value.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

O. Hartmann-4
On Mon, 8 Dec 2014 04:43:05 -0500
grarpamp <[hidden email]> wrote:

> HyperThreading on Intel Xeon Haswell, a benefit?
>
> What bits of FreeBSD are aware and can take proper advantage of
> Intel HTT, such as its thread/process schedulers (sched-BSD/ULE/...),
> etc?
>
> What system/app loads are, or are not, likely to benefit with today's
> HyperThreading CPU's? Kernel (ZFS/crypto/net/...) vs.  Userland
> (apps)?
>
> Does anyone have performance stats for this current class of CPU
> to post comparing HT (enabled and disabled) while using more than
> four processes/threads in parallel?
>
> For instance, these two Intel Xeon Haswell four core CPU's are
> identical except for HT [1] (e3-1226v3 and e3-1246v3), and you
> can always turn HT off for testing.
> http://ark.intel.com/compare/80917,80916
>
> There are some Core i3/i5/i7 Haswell parts with HT as well.
> http://ark.intel.com/Search/Advanced?s=t&ECCMemory=true&VTD=true&AESTech=true
>
> There don't seem to be many reviews of Xeon processors, let alone
> HT. And most Unix talk of HT seems dated by at least a few years
> and a couple processor generations.
>
> Also, was the HT cache leak security issue from a decade ago ever
> fixed in hardware?
> "Cache missing for fun and profit"
> http://www.daemonology.net/papers/
>
> Being unsure of the best list, please direct replies to whichever
> is good. Thanks.
>
> [1] Plus 200MHz/6% clock per core and $59/27% market price bumps,
> but this thread is about whether or not there is any benefit to HT
> in current Intel CPU's such as Haswell, how much of one, and where.
> Once that is determined, then you can factor in other parameters
> like these to see if it's an overall value.
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to
> "[hidden email]"

Hello.

Well, I have a very narrow and some sort of naive experience, so be
warned.

From my experience, mostly compiling FreeBSD sources from scratch
(deleted /usr/obj, no sophisticated caching subsystems used), compiling
world and kernel with as many threads allowed as possible (using
value of possible threads via PARA=`sysctl -n hw.ncpu` and use then
$PARA as variable for "make -j${PARA} ..."), a dual core, 4-thread CPU
at 3.3 GHz takes ~ 60 minutes to build world, the same as a 4-core
castrated i3 with disabled SMT. Switching off SMT on the dual core
results in roughly 90 - 100 minutes compile time in my case, depending
on the average load of the box while compiling. So, for the INTEGER
performance, I see some real benefits of SMT.

The picture is somehow different for the floating point performance.
Using SMT in some FPU heavy caclulations on Sandy- and Ivy-Bridge CPUs
(Haswell is not available as XEON to me at this very moment), I see
only 10% - a max. of 25% (roughly estimated on some crude manually
timed calculations!). There is some sligt benefit, even better with
most recent Ivy-Bridge than Sandy-Bridge and bot latter seem to be
superior in that matter to some Westmere 6-Core XEONS we used to use a
couple of years ago (this may be related to some other architectural
design improvements other than SMT, like the ring bus introduced in
Sandy Bridge and improved in Ivy Bridge and maybe Haswell).

In earlier times (pre Sandy-Bridge era) there were issues were it
would be beneficial switching off SMT for heavy FPU load in some
BLAS/LAPACK based benchmark scenarios, but this knowledge is years
ago with older P4 designs and early Core i7. I lost track of that.

To make it short: I would highly recommend using/purchasing SMT
capable CPUs since there is a benefit in performance. But at the end
the performance gain has to meet the costs of a SMT capable XEON. As
far as I know, most of the "value" XEONs do have SMT by default.

There are some disadvantages regarding the amount of memory the
kernel has to consume for each core (logical and/or physical) found,
so systems with small amounts of physical RAM (< 8 GiB) could run
into disadvantageous situations - if I'm not wrong. But for all
FreeBSD users considering using ZFS fro
professional/semiprofessional usage, 8 GiB at least is a must,
otherwise the ZFS system is crippling performance, not SMT.

oh    
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

Adrian Chadd-2
I've done some basic experimenting with SMT on network loads. For the
most part, as long as you don't fill up one of the ports on the
execution engine that's doing SMT, you're okay.

I've found that a memcpy heavy load (read: normal, non-zero copy
network traffic) brings SMT threads to their knees. A pair of threads
gets as much work done in normal UDP transmit/receive as a single
non-SMT thread. It looks like it's because the ports doing memory
input/output are full and there's not really any other work that's
being done.

I think haswell still only has one store data port per core. :(



-adrian


On 8 December 2014 at 06:39, O. Hartmann <[hidden email]> wrote:

> On Mon, 8 Dec 2014 04:43:05 -0500
> grarpamp <[hidden email]> wrote:
>
>> HyperThreading on Intel Xeon Haswell, a benefit?
>>
>> What bits of FreeBSD are aware and can take proper advantage of
>> Intel HTT, such as its thread/process schedulers (sched-BSD/ULE/...),
>> etc?
>>
>> What system/app loads are, or are not, likely to benefit with today's
>> HyperThreading CPU's? Kernel (ZFS/crypto/net/...) vs.  Userland
>> (apps)?
>>
>> Does anyone have performance stats for this current class of CPU
>> to post comparing HT (enabled and disabled) while using more than
>> four processes/threads in parallel?
>>
>> For instance, these two Intel Xeon Haswell four core CPU's are
>> identical except for HT [1] (e3-1226v3 and e3-1246v3), and you
>> can always turn HT off for testing.
>> http://ark.intel.com/compare/80917,80916
>>
>> There are some Core i3/i5/i7 Haswell parts with HT as well.
>> http://ark.intel.com/Search/Advanced?s=t&ECCMemory=true&VTD=true&AESTech=true
>>
>> There don't seem to be many reviews of Xeon processors, let alone
>> HT. And most Unix talk of HT seems dated by at least a few years
>> and a couple processor generations.
>>
>> Also, was the HT cache leak security issue from a decade ago ever
>> fixed in hardware?
>> "Cache missing for fun and profit"
>> http://www.daemonology.net/papers/
>>
>> Being unsure of the best list, please direct replies to whichever
>> is good. Thanks.
>>
>> [1] Plus 200MHz/6% clock per core and $59/27% market price bumps,
>> but this thread is about whether or not there is any benefit to HT
>> in current Intel CPU's such as Haswell, how much of one, and where.
>> Once that is determined, then you can factor in other parameters
>> like these to see if it's an overall value.
>> _______________________________________________
>> [hidden email] mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
>> To unsubscribe, send any mail to
>> "[hidden email]"
>
> Hello.
>
> Well, I have a very narrow and some sort of naive experience, so be
> warned.
>
> From my experience, mostly compiling FreeBSD sources from scratch
> (deleted /usr/obj, no sophisticated caching subsystems used), compiling
> world and kernel with as many threads allowed as possible (using
> value of possible threads via PARA=`sysctl -n hw.ncpu` and use then
> $PARA as variable for "make -j${PARA} ..."), a dual core, 4-thread CPU
> at 3.3 GHz takes ~ 60 minutes to build world, the same as a 4-core
> castrated i3 with disabled SMT. Switching off SMT on the dual core
> results in roughly 90 - 100 minutes compile time in my case, depending
> on the average load of the box while compiling. So, for the INTEGER
> performance, I see some real benefits of SMT.
>
> The picture is somehow different for the floating point performance.
> Using SMT in some FPU heavy caclulations on Sandy- and Ivy-Bridge CPUs
> (Haswell is not available as XEON to me at this very moment), I see
> only 10% - a max. of 25% (roughly estimated on some crude manually
> timed calculations!). There is some sligt benefit, even better with
> most recent Ivy-Bridge than Sandy-Bridge and bot latter seem to be
> superior in that matter to some Westmere 6-Core XEONS we used to use a
> couple of years ago (this may be related to some other architectural
> design improvements other than SMT, like the ring bus introduced in
> Sandy Bridge and improved in Ivy Bridge and maybe Haswell).
>
> In earlier times (pre Sandy-Bridge era) there were issues were it
> would be beneficial switching off SMT for heavy FPU load in some
> BLAS/LAPACK based benchmark scenarios, but this knowledge is years
> ago with older P4 designs and early Core i7. I lost track of that.
>
> To make it short: I would highly recommend using/purchasing SMT
> capable CPUs since there is a benefit in performance. But at the end
> the performance gain has to meet the costs of a SMT capable XEON. As
> far as I know, most of the "value" XEONs do have SMT by default.
>
> There are some disadvantages regarding the amount of memory the
> kernel has to consume for each core (logical and/or physical) found,
> so systems with small amounts of physical RAM (< 8 GiB) could run
> into disadvantageous situations - if I'm not wrong. But for all
> FreeBSD users considering using ZFS fro
> professional/semiprofessional usage, 8 GiB at least is a must,
> otherwise the ZFS system is crippling performance, not SMT.
>
> oh
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "[hidden email]"
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

grarpamp
> Ohartmann:
> From my experience, mostly compiling FreeBSD sources from scratch
> ...
> a dual core, 4-thread CPU
> at 3.3 GHz takes ~ 60 minutes to build world, the same as a 4-core
> castrated i3 with disabled SMT. Switching off SMT on the dual core
> ...
> Using SMT in some FPU heavy caclulations on Sandy- and Ivy-Bridge CPUs
> (Haswell is not available as XEON to me at this very moment), I see

> Adrian:
> I've done some basic experimenting with SMT on network loads.
> ...
> I've found that a memcpy heavy load (read: normal, non-zero copy


Ohartmann, Adrian...
Good introductory info.
What were your CPU models / lines / sSpec numbers above?
Anyone else?



Expanding...

This evaluation should not be strictly confined to Intel, after
all, AMD has CMT which is similar to HTT (not clear whether it's
on Opteron, FX or APU lines). Though it will probably be 2016 before
AMD really capitalizes and shines on their full architecture vision.
By then Intel will just shift a few gears to match. So we should
probably stay on subject Intel HTT for now.

http://wccftech.com/amds-high-performance-processor-cores-coming-2015-giving-modular-architecture/
http://en.wikipedia.org/wiki/Simultaneous_multithreading
http://en.wikipedia.org/wiki/Hyper-threading
http://forums.anandtech.com/showthread.php?t=2381524

My thought is that the available evaluations of SMT are all 'old'...
discontinued processors, old compilers, old schedulers, etc, all
dating back to the Intel P4 arch. So let's bring this current in
terms of today's Intel Haswell and AMD APU/FX processors,
with new tests and community data. (Opteron is still on an even
'older' architecture [refresh] compared to FX and APU.)

http://anandtech.com/show/8742/amd-announces-carrizo-and-carrizol-next-gen-apus-for-h1-2015
http://wccftech.com/amd-berlin-server-apu-glimpse-upcoming-kaveri-apu-4-steamroller-cores-512-gcn-sps/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

Jia-Shiun Li
In reply to this post by Adrian Chadd-2
On Tue, Dec 9, 2014 at 12:15 AM, Adrian Chadd <[hidden email]> wrote:
> I've found that a memcpy heavy load (read: normal, non-zero copy
> network traffic) brings SMT threads to their knees. A pair of threads
> gets as much work done in normal UDP transmit/receive as a single
> non-SMT thread. It looks like it's because the ports doing memory
> input/output are full and there's not really any other work that's
> being done.
>
> I think haswell still only has one store data port per core. :(

Yes, Haswell has an additional store addr but still only one store data unit.

http://www.tomshardware.com/reviews/core-i7-4770k-haswell-review,3521.html

But I guess they'd argue that they meant to saturate memory
channels with all available cores as possible first, and additional
threads are only for last resort. And that's probably what the most
schedulers do.

I benchmarked it on a 4th gen i3. Buildkernel got 5~10% benefit IIRC.
The best way to tell is still to conduct tests with your own workload.
If the claimed 5% transistor cost brings 10% benefits, that's already
a win. OTTH how much you paid for it is another story.


- Jia-Shiun.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

grarpamp
On Wed, Dec 10, 2014 at 8:19 AM, Jia-Shiun Li <[hidden email]> wrote:

> Yes, Haswell has an additional store addr but still only one store data unit.
>
> http://www.tomshardware.com/reviews/core-i7-4770k-haswell-review,3521.html
>
> But I guess they'd argue that they meant to saturate memory
> channels with all available cores as possible first, and additional
> threads are only for last resort. And that's probably what the most
> schedulers do.
>
> I benchmarked it on a 4th gen i3. Buildkernel got 5~10% benefit IIRC.
> The best way to tell is still to conduct tests with your own workload.
> If the claimed 5% transistor cost brings 10% benefits, that's already
> a win. OTTH how much you paid for it is another story.

Where is the claim of "5% transistor cost" from?
I don't see it linked in this thread.
Is it in terms of $ as a sales feature to get HT/SMT, or transistor
count to get it? I think SMT transistor count could change over
CPU generations optimized.

Any bump in price to get HT, is amortized over time.
Any bump in performance due to HT, is integrated over time.
A watt costs about $1/yr.
If SMT is 5% faster, over 4hr saves 12 minutes of your time, which
saves $n/day, which more than pays for purchase and watts.
If it is slower, it hurts similarly $hard unless you turn it off
and eat its purchase difference.

Thus to see what people were seeing perf wise.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

Jia-Shiun Li
On Mon, Dec 15, 2014 at 1:40 PM, grarpamp <[hidden email]> wrote:
> Thus to see what people were seeing perf wise.

HTT good for *some* workload? Definitely yes.
HTT good for yours? It depends.
It is not a solution to boost everything.

You really need your own evaluation methods for your own real world workload.
See if this extreme case motivates you.

Script started on Fri Dec 19 04:11:36 2014
root@:~ # uname -a
FreeBSD  11.0-CURRENT FreeBSD 11.0-CURRENT #0 r275582: Sun Dec  7
22:29:51 UTC 2014
[hidden email]:/usr/obj/usr/src/sys/GENERIC  amd64
root@:~ # sysctl hw.model
hw.model: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
root@:~ # sysctl dev.cpu.0.freq
dev.cpu.0.freq: 3600
root@:~ # openssl speed -evp aes-256-cbc
(...)
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc     542270.57k   570008.23k   577700.69k   579443.71k   578661.43k
root@:~ # openssl speed -evp aes-256-cbc -multi 4
(...)
evp            2168111.69k  2283320.41k  2309259.69k  2314799.72k  2323428.69k
root@:~ # openssl speed -evp aes-256-cbc -multi 8
(...)
evp            3720872.65k  4373485.85k  4564089.08k  4615834.28k  4621740.71k
root@:~ # openssl speed -evp aes-256-gcm
(...)
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     372850.59k   941017.15k  1402284.69k  1518668.74k  1552422.23k
root@:~ # openssl speed -evp aes-256-gcm -multi 4
(...)
evp            1492887.94k  3132772.74k  4501002.29k  4929483.52k  5101510.17k
root@:~ # openssl speed -evp aes-256-gcm -multi 8
(...)
evp            1924978.05k  4533256.96k  6764018.70k  7538217.64k  7985778.30k
root@:~ # openssl speed -evp aes-256-ctr
(...)
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-ctr     491349.11k  1550444.11k  2372213.47k  2755245.59k  2879939.33k
root@:~ # openssl speed -evp aes-256-ctr -multi 4
(...)
evp            1871084.37k  4992105.40k  7707692.29k 10242874.37k 10744955.05k
root@:~ # openssl speed -evp aes-256-ctr -multi 8
(...)
evp            2678304.52k  7575305.94k 11861913.17k 12971657.56k 13356457.98k
root@:~ # ^D  exit

Script done on Fri Dec 19 04:16:08 2014



-Jia-Shiun.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: HyperThreading on Intel Xeon Haswell, a benefit?

grarpamp
On Thu, Dec 18, 2014 at 11:44 PM, Jia-Shiun Li <[hidden email]> wrote:
> You really need your own evaluation methods for your own real world workload.

Of course. Yet many users here might like to compare common things:
o make buildworld times
o iozone if disks are not limiting factor
o openssl speed as below
o etc

> See if this extreme case motivates you.

Not much since it provides no information regarding HTT.
It doesn't, at minimum:
o  State whether HTT was on or off.
o  Show the results of two runs, one with HTT on, one with HTT off.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[hidden email]"