Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Number crunching :
Linux & Hyperthreading
Author |
Message |
|
Hi again
I've spent this afternoon exploring the effects of HT (Intel Hyperthreading) on short sieving apps.
The reputation amongst gamers is that you don't have to disable HT in Linux, if you keep the CPU occupancy to 50% the Linux Kernel does the sensible thing. Is this true for this Project (more exactly for the ESP app)?
/the short, approximate answer is yes, it is more or less true.
Befoire I describe the results I'll just digress to describe how to get Linux to tell you if you have HT active. There are two ways, neither as easy as looking for a tick in a box, I'm afraid.
Method A.
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
produces output like this if you are hyperthreaded
0,2
1,3
0,2
1,3
and like this if you are not
0
1
Notice there is one line of ouput for each virtual cpu, so the number of lines is the first hint. More importantly, each line shows a pair of virtual cpus that are actually the same physical core.
Method B
cat /proc/cpuinfo | less
produces long output. For any of the cpus reported, compare the number of siblings to the number of cores. If the numbers match, you are not using HT, if siblings is bigger, you are.
I was using this recent i7-u machine, which unusually for an i7has just two cores despite having HT.
Tests were done using all cpus in BOINC settings with HT on, HT off, and with the kernel hobbled to a single thread (no SMP). A final test was done with HT enabled but BOINC set to use only 50% of cpus.
These are the run times, rounded to nearest sec
ncpu BOINC SMP? HT elapsed tim cpu time
all Y Y 2203 2174
2192 2166
2192 2164
2189 2163
all Y N 1342 1315
1344 1311
2 Y Y 1331 1327
1330 1327
all N - 1339 1300
The above shows that I got about half as much work again out of HT.
More interesting, I thought, was that it is slightly more efficient in terms of run duration to leave HT enabled and simply cut down the number of cores than to disable HT. The opposite is true if we want to optimise on the basis of cpu time. The reason seems clear: with HT enabled, when the kernel or boinc_client or boinc_manager want to run, they will be scheduled into a virtual core, only slowing down the main taks slightly, but contention within the physical core extending the cpu time.
The cpui time when running as a single core machine is the minimum, but interestingly the same HT effect means that the duration is barely different with 1 or 2 cores active.
So the practical advice form this short experiment is to leave HT enabled. If the quick turnaround of a WU is important, limit the number of CPUs using BOINC rather than disabling HT.
This advice may vary for longer units - and note especially the advice on the prefes page for the longest Genefer app.
R~~
PS for those who remember another recent thread of mine, note that this is a different machine entirely - and just a tad faster ;) | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2246 ID: 29980 Credit: 351,428,076 RAC: 346,869
                         
|
You're like the Linux compatible version of me :)
Was this testing for single units or did you run and average multiples? The possible advantage from 2 threads HT on vs HT off is ~1% as presented. I've tested many cases under Windows, but not that one.
Under almost all situations, the general advice is to run sieve on all threads with HT on, on 64-bit OS for maximum throughput. The only possible exception might be at the end of a challenge if you can get half the units done faster to make a deadline, but that's a specific niche case.
On the other hand, LLR does not gain from HT so only run one per actual core. Well, nothing within a % or two of possible measurement error. I found under Windows running one LLR per real core with HT on was ~10% less efficient in throughput than running all cores with HT off. I suspect there is something inefficient with resource loading by Windows in that situation, and saw this on multiple systems when testing. I only found this as I was messing around with Mint I think at the time, and the scheduler in that wasn't doing sane things either ending up running multiple threads per core with others sitting idle, resulting in a massive performance penalty. I had to resort to turning off HT as a quick fix since messing around with affinity is beyond my skills. That showed the performance difference to the HT still on Windows systems at the time. Windows systems similarly gained by having HT off, and the OSes were equal to each other. While genefer isn't specifically covered, I believe it would behave similarly to LLR, not sieve due to the instruction types used.
Having written the above something doesn't quite add up. I'm pretty sure in separate testing with HT, I've seen that performance running 2 threads per core with HT just made things take twice as long as running 1 thread per core with HT, so no throughput benefit. That must not have included benefit from turning off HT, but intuitively that doesn't sound right... I doubt my past notes are detailed enough to clarify that adequately, and I'm not likely to test that again any time soon, so will leave that as an open question. For now, best LLR performance is either HT off, or if HT is on, manually set affinity to prevent it from ever running two LLR threads on the same core. | |
|
|
You're like the Linux compatible version of me :)
;}
Was this testing for single units or did you run and average multiples?
these are the times for single units, running together in each group.
The run times seem so uniform within each group that I decided there was no need to average.
The possible advantage from 2 threads HT on vs HT off is ~1% as presented. I've tested many cases under Windows, but not that one.
yes, it is very small. For me the biggest advantage is that I know that the impact of "other work" on PG throughput is less with HT on - things like surfing or word processing that I actually bought the machine for! I'd actually accept a small performance hit for that benefit, so it's nice to know the it's not actually a hit at all (in elapsed time)
Under almost all situations, the general advice is to run sieve on all threads with HT on, on 64-bit OS for maximum throughput. .... On the other hand, LLR does not gain from HT so only run one per actual core.
I may (or may not) get around to testing that sometime ;) | |
|
|
On the other hand, This AMD laptop may perform differently. AMD doesn't do hyperthreading as such. For integer work each of the four cores is a genuine physical core
For float, pairs of the cores share hardware. . This made sense to the AMD designers as the float processors take up much more space on the die.
So the prediction might be that four cores EACH run as fast as two if its integer work or if one thread is doing float and the other integer., Equally, theory suggests four cores run JOINTLY only as fast as two if its mainly float on both threads at the same time.
There is no way to disable this effect with an AMD cpu, as far as I know, apart from simply using fewer of the cores, by BOINC settings or with the maxcpus=2 directive to the kernel at boot. I only tried using the BOINC settings, again as this would be my preferred way of working so that I have the extra cores available for casual work.
Sticking again with the ESP sieving units (Sierpinski Problem ESP/PSP/SoB (Sieve) v1.12) let's see what we got. Again these are run times for single tasks, running together in the groups. This time I did two pairs of tasks when Ncpu=2.
Ncpus elapsed cpu
2 4389 4363
4383 4361
2 4348 4345
4343 4340
4 7261 6065
7268 6070
7256 6061
7271 6075
Here in terms of elapsed time it takes about 8700sec to do four tasks two at a time, but only 7300 when run all at once, A saving of almost 20%.
The most striking thing, comparing AMD with Intel, is that the AMD throughput is only about 25% of the Intel. An unfair comparison, as the AMD chip is a generation behind the Intel.
Ignoring that, the next thing that jumps out of the figures is that the AMD includes a much larger discrepancy between eapsed and cpu time when running on all virtual cores. This makes me think that maybe the effect is nothing to do with the shared FPUs but rather to do with cache or memory access. It isn't swap, as the swap partition was available but unused throughout.
@mackerel: you suggested to run on all cores, except when rushing to meet a deadline when running sieving tasks. Whatever the cause of the effect, your advice seems good here too :)
R~~ | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2246 ID: 29980 Credit: 351,428,076 RAC: 346,869
                         
|
@mackerel: you suggested to run on all cores, except when rushing to meet a deadline when running sieving tasks. Whatever the cause of the effect, your advice seems good here too :)
I can't take credit for it, as the advice is generally known since even before I started here...
The speed performance between AMD and intel is another story. While I think I understand the floating point power with regard to LLR, integer performance for sieve is a total unknown to me. I would have guessed AMD to be much stronger there. | |
|
|
Hi !
I was wondering too, if I have to turn off HT or not.
Because of heat, here are my options :
#1 With HT enabled, I can run 5 WU at the same time @3.4GHz each (turbo mode disabled)
#2 With HT disabled, I can run 3 WU at the same time @3.8GHz each (turbo mode enabled)
What's the best option to take ?
Now I'm with option #2 (because WU are faster), but I'm wondering if spending long hours of job done will results in better work done with option #1.
Any advice ?
Thanks.
| |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 855 ID: 370496 Credit: 294,676,910 RAC: 89,505
                  
|
Hi !
I was wondering too, if I have to turn off HT or not.
Because of heat, here are my options :
#1 With HT enabled, I can run 5 WU at the same time @3.4GHz each (turbo mode disabled)
#2 With HT disabled, I can run 3 WU at the same time @3.8GHz each (turbo mode enabled)
What's the best option to take ?
Now I'm with option #2 (because WU are faster), but I'm wondering if spending long hours of job done will results in better work done with option #1.
Any advice ?
Thanks.
If you want to run LLR, always turn it off. It only does harm, due to a serious lack of cache and RAM bandwidth. | |
|
|
But LLR is not the main goal of my computer, i'm running several projects, and several type of WU (LLR / CPU intensive, but also other not affected by HT).
With HT enabled means more time to complete WU, but executing 5 WU at the same time. The ratio between 5 at a time @3.4GHz and 3 at a time @ 3.8GHz, the overall more job is done with 5 at a time. I'll go with the #1 option then.
And if I only have to do LLR WU, i'll switch to #2 option :). | |
|
|
Well, I'm doing some couple of tests. HT does NOT run 5 WU at the same time (percentages are pausing somehow).
5 WU on 3 physical CPU means never have 100% CPU per WU. | |
|
|
Sounds like you may have mixed up the BOINC setting to use less than 100% CPU time when what I think you want is to use less than 100% of CPUs. | |
|
|
No, I set 100% CPU time, but 80% of multiprocessors, resulting in using 6/8 virtual CPU. | |
|
Message boards :
Number crunching :
Linux & Hyperthreading |