Turning off Hyper-Threading in 6-core Intel Xeon
We got a 12-core MacPro to do some Monte Carlo calculations. Its Intel Xeon processors have Hyper-Threading (HT) enabled, so in fact there should be 24 processes running in parallel to make them fully utilized. However, our calcs are more efficient to run on 12x100% than 24x50%, so we tried to turn Hyper-Threading off via Processor
pane in system preferences in order to get higher performance. One can also turn HT off by
hwprefs -v cpu_ht=false
Then we ran some tests and here is what we got:
- 12 parallel tasks run the same time w/ or w/o HT to our disappointment.
- 24 parallel tasks loose 20% if HT is off (not -50% as we thought)
- When HT is on, switching from 24 to 12 tasks decreases efficiency by 20% (also surprising)
- When HT is off, switching from 24 to 12 doesn't change anything.
It seems that Hyper-Threading just decreases performance for our calculations and there is no way to avoid it. The program we use for the calcs is written in Fortran and compiled with gfortran
. Is there a way to make it more efficient with this piece of hardware?
Update: Our Monte Carlo calculations (MCC) are typically done in steps to avoid data loss and due to other reasons (it's not always possible to avoid such steps). In our case each step consists of many simulations with variable duration. Since each step is splited between a number of parallel tasks, they also have variable duration. Essentially, all faster tasks have to wait until the slowest is done. This fact forces us to make bigger s开发者_Python百科teps, which finish with less deviation in time due to averaging, so processors do not waste their time on waiting. This is our motivation for having 12*2.66 GHz instead of 24*1.33 GHz. If it would be possible to turn HT off, then we would get about +10% performance by switching from 24 tasks w/ HT to 12 tasks w/o HT. However, the tests show that we loose 20%. So my conclusion is that the calculation is 30% as inefficient.
For the tests I used quite large steps, however usually steps are shorter, so efficiency becomes even further.
There is one more reason - some of our calculations require 3-5 GB of memory, so you probably see how economical it would be for us to have 12 fast tasks. We are working on implementing shared memory, but it's going to be a looong term project. Therefore we need to find out how to make the existing hardware/software as fast as possible.
This is more of an extended comment than an answer:
I don't find your observations terrifically surprising. Hyper-threading is a poor-man's approach to parallelisation, it allows you to have 2 pipelines of pending instructions on one CPU. But it doesn't provide extra floating-point or integer arithmetic units or more registers; when one pipeline is unable to feed the ALU (or whatever it's called these days) the other pipeline is activated within a clock cycle or two. This contrasts with the situation on a CPU without hyperthreading where, when the instruction pipeline stalls, it has to be flushed and refilled with instructions from another process before the CPU gets back up to speed.
The Wikipedia article on hyperthreading explains all this rather well.
If you are running loads in which pipeline stalls are perfectly synchronised and represent a major part of the total execution time of your program mix, then you might double the speed of a program by going from an unhyperthreaded processor to a hyperthreaded processor.
IF (that's a big if) you could write a program which never stalled in the instruction pipeline then hyperthreading would provide no benefit (in terms of execution acceleration) whatsoever. What you have measured is not a speedup due to HT (well, it is a speedup due to HT but you don't actually want that) but the failure of your threads to keep the pipeline moving.
What you have to do is actually decrease the speedup due to HT ! Or, rather, you have to increase the execution rate of the 12 processes (one per core) by keeping the pipeline filled. Personally, I'd switch off hyperthreading while I optimised the program's execution on 12 cores.
Have fun.
I'm having a bit a of difficulty understanding your description of the benchmarks.
Lets define 100% to be the amount of work you manage to get done with 12 tasks and ht off. And if you were to be able to get twice as much done in the same period of time, we would call it 200%. So, what are the numbers that you would put in the other three boxes?
Edit: Updated with your numbers.
without HT with HT
12 tasks 100% 100%
24 tasks 100% 125%
So, my understanding is that with HT disabled, there are gaps of time while your threads are basically paused (such as when they are waiting for data from memory or from disk), so they don't actually run at 2.66 GHz, but a bit less. With hyperthreading enabled, the CUP switches tasks instead of pausing for these momentary gaps, so the total amount of processing power being used goes up.
Well, that means that with HT on, switching from 12 tasks to 24 tasks increases efficiency by 20%! Good benchmarking!
On the other hand, if your program is written so that each thread can only work on a separate task (as opposed to being able to split a single task into smaller chunks and proceed concurrently), then for the purpose of reducing the latency for each task (from start to finish) you simply need to limit the number of threads to 12 in software. The hardware HT switch can remain in either position.
See this posting for an app in Xcode tools to enable / disable hyperthreading (and number of CPUs active). The setting does NOT persist across sleep or reboot: http://www.logicprohelp.com/forum/viewtopic.php?f=5&t=88835
(You run the Instruments app, cancel the initial screen, and then change the CPU Preferences).
精彩评论